Publication View

Abstract Pivoted Document Length Normalization (2008)

Abstract
Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Docu-ment length normalization is used to fairly retrieve docu-ments of all lengths. In this study, we ohserve that a nor-malization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrievaf probabilities for a particular normal-ization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training piv-oted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collec-tzorz independent normalization technique. We use the idea of pivoting with the well known cosine normalization func-tion. We point out some shortcomings of the cosine func-tion andpresent two new normalization functions–-pivoted unique normalization and piuotert byte size nornaahzation. 1

Publication details
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=?doi=10.1.1.128.1360
Source http://www.cs.umbc.edu/~nicholas/676/p21-singhal.pdf
Contributors CiteSeerX
Repository CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Type text
Language English
Relation 10.1.1.101.9086, 10.1.1.32.9922, 10.1.1.98.7863, 10.1.1.46.7245, 10.1.1.50.3569