| Abstract Pivoted Document Length Normalization (2008) | |||||||||||||||
Abstract | |||||||||||||||
| Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Docu-ment length normalization is used to fairly retrieve docu-ments of all lengths. In this study, we ohserve that a nor-malization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrievaf probabilities for a particular normal-ization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training piv-oted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collec-tzorz independent normalization technique. We use the idea of pivoting with the well known cosine normalization func-tion. We point out some shortcomings of the cosine func-tion andpresent two new normalization functions–-pivoted unique normalization and piuotert byte size nornaahzation. 1 | |||||||||||||||
Publication details | |||||||||||||||
| |||||||||||||||