| Scaling EM (Expectation-Maximization) Clustering to Large Databases (1999) | |||||||||||||||||
Abstract | |||||||||||||||||
| : Practical statistical data clustering algorithms require multiple data scans to converge. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework requiring at most one scan of the database, and apply it to the Expectation-Maximization (EM) algorithm. Unlike distance-based or hard membership algorithms (such as k-Means) EM is known to be an appropriate optimization algorithm for constructing proper statistical models of the data. It also easily accommodates categorical and continuous data fields. It admits varying degrees of data membership in multiple clusters. Our scalable method is based on identifying regions of the data that are compressible and regions that must be maintained in memory. The approach operates within the confines of a limited memory buffer. Data resolution is preserved to the extent possible based upon the size of the memory buffer and the fit of the current clustering model to the data. We extend the framework ... | |||||||||||||||||
Publication details | |||||||||||||||||
| |||||||||||||||||