Publication View

Topic-bridged PLSA for Cross-Domain Text Classification (2009)

Abstract
In many Web applications, such as blog classification and newsgroup classification, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data in a related but different domain. Traditional text classification approaches are not able to cope well with learning across different domains. In this paper, we propose a novel cross-domain text classification algorithm which extends the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data, which come from different but related domains, into a unified probabilistic model. We call this new model Topic-bridged PLSA, or TPLSA. By exploiting the common topics between two domains, we transfer knowledge across different domains through a topic-bridge to help the text classification in the target domain. A unique advantage of our method is its ability to maximally mine knowledge that can be transferred between domains, resulting in superior performance when compared to other state-of-the-art text classification approaches. Experimental evaluation on different kinds of datasets shows that our proposed algorithm can improve the performance of cross-domain text classification significantly.

Publication details
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.831
Source http://www.cse.ust.hk/~qyang/Docs/2008/fp352-xue.pdf
Contributors CiteSeerX
Repository CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Keywords Cross-Domain, Text Classification
Type text
Language English
Relation 10.1.1.133.4884, 10.1.1.11.6124, 10.1.1.32.9956, 10.1.1.1.5684, 10.1.1.20.9305, 10.1.1.109.2516, 10.1.1.33.1187, 10.1.1.33.6843, 10.1.1.33.3342, 10.1.1.103.1693, 10.1.1.7.9416, 10.1.1.33.906, 10.1.1.1.6557, 10.1.1.94.594, 10.1.1.114.6412, 10.1.1.76.8036, 10.1.1.76.7577, 10.1.1.103.8462