Publication View

Almost-Constant-Time Clustering of Arbitrary Corpus Subsets (1997)

Abstract
Methods exist for constant-time clustering of corpus subsets selected via Scatter/Gather browsing [3]. In this paper we expand on those techniques, giving an algorithm for almostconstant -time clustering of arbitrary corpus subsets. This algorithm is never slower than clustering the document set from scratch, and for medium-sized and large sets it is significantly faster. This algorithm is useful for clustering arbitrary subsets of large corpora --- obtained, for instance, by a boolean search --- quickly enough to be useful in an interactive setting. 1 Introduction Document clustering has emerged as an important tool for the presentation and navigation of document collections. For example, the Scatter/Gather browsing paradigm clusters documents into topic-coherent groups and presents descriptive textual summaries to the user [2]. Informed by the summaries, the user may select clusters, thereby forming a subcollection, for iterative examination. The clustering and reclustering is done ...

Publication details
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7290
Source http://www-cs-students.stanford.edu/~csilvers/papers/sm-sigir.ps
Contributors CiteSeerX
Repository CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Type text
Language English
Relation 10.1.1.34.6746, 10.1.1.34.6233, 10.1.1.43.1252, 10.1.1.41.2605, 10.1.1.38.4937, 10.1.1.21.3062, 10.1.1.27.1592, 10.1.1.102.8546, 10.1.1.61.942, 10.1.1.33.1855, 10.1.1.80.8015, 10.1.1.73.5431, 10.1.1.22.9199, 10.1.1.125.7772, 10.1.1.40.7762, 10.1.1.78.9841, 10.1.1.83.8247, 10.1.1.93.1692, 10.1.1.112.3770, 10.1.1.123.5531, 10.1.1.25.6956, 10.1.1.9.6579, 10.1.1.9.9487