Publication View

The Distribution of N-Grams (2000)
  • Egghe, Leo [83]

Abstract
N-grams are generalized words consisting of N consecutive symbols, as they are used in a text. This paper determines the rank-frequency distribution for redundant N-grams. For entire texts this is known to be Zipf''s law (i.e., an inverse power law). For N-grams, however, we show that the rank (r)-frequency distribution is [formule] where psgrN is the inverse function of fN(x)=x lnN–1x. Here we assume that the rank-frequency distribution of the symbols follows Zipf''s law with exponent beta.

Publication details
Download http://hdl.handle.net/1942/788
Publisher Springer
Repository Document Server@UHasselt (Belgium)
Type Article
Language Englisch
Relation http://dx.doi.org/10.1023/A:1005634925734

Cited publications (2)
On the law of Zipf-Mandelbrot for multi-word phrases (1999)
  • Egghe, Leo [83]
Duality in information retrieval and the hypergeometric distribution (1997)