| Using Clustering to Boost Text Classification (2001) | |||||||||||||||
Abstract | |||||||||||||||
| In recent years we have seen a tremendous growth in the number of text document collections available on the Internet. Automatic text categorization, the process of assigning unseen documents to user-defined categories, is an important task that can help in the organization and querying of such collections. In this article we consider the problem of classifying online papers from a specific journal in the geological sciences, over a set of expert defined categories. We evaluate two general strategies and several variants thereof. The first strategy is based on Nave Bayes, a popular text classification algorithm. The second strategy is based on Principle Direction Divisive Partitioning, an unsupervised document clustering algorithm. While the performance of both approaches is quite good, some of the new variants that we propose including one, which involves a combination of these two approaches yield even better overall performance. | |||||||||||||||
Publication details | |||||||||||||||
| |||||||||||||||