Publication View

Abstract Pivoted Document Length Normalization (2008)

Abstract
Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we observe that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very di erent from their likelihood of relevance. We show that the retrieval probabilities for a particular normalization method deviate systematically from the relevance probabilities across di erent collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collection independent normalization technique. We use the idea of pivoting with the well known cosine normalization function. We point out some shortcomings of the cosine function and present two new normalization functions | pivoted unique normalization and pivoted byte size normalization. 1

Publication details
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.8595
Source http://ir.iit.edu/~dagr/cs529/files/handouts/singhal96pivoted.pdf
Contributors CiteSeerX
Repository CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Type text
Language English
Relation 10.1.1.101.9086, 10.1.1.32.9922, 10.1.1.98.7863, 10.1.1.46.7245, 10.1.1.50.3569