Publication View

Text Classification using String Kernels (2008)

Abstract
We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by anexponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be e ciently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a standard word feature space kernel Joachims (1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with di erent decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations e ciently for large datasets.

Publication details
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=?doi=10.1.1.130.2853
Source http://www.site.uottawa.ca/~stan/csi5387/string.pdf
Contributors CiteSeerX
Repository CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Keywords Kernels and Support Vector Machines, String Subsequence Kernel, Approximating Kernels, Text Classi cation 1
Type text
Language English
Relation 10.1.1.117.4939, 10.1.1.103.1189, 10.1.1.40.7894, 10.1.1.21.2820, 10.1.1.23.6757, 10.1.1.40.4778, 10.1.1.43.3153, 10.1.1.127.1187, 10.1.1.122.7088, 10.1.1.19.8980, 10.1.1.72.604, 10.1.1.19.8258, 10.1.1.126.4610, 10.1.1.90.7556, 10.1.1.86.7384, 10.1.1.7.3622, 10.1.1.58.948, 10.1.1.71.5318, 10.1.1.36.581, 10.1.1.59.7117, 10.1.1.125.3532, 10.1.1.10.4861, 10.1.1.127.7292, 10.1.1.70.9926, 10.1.1.87.3724, 10.1.1.9.3595, 10.1.1.10.8022, 10.1.1.106.8299, 10.1.1.131.8163, 10.1.1.10.7518, 10.1.1.111.7226, 10.1.1.68.5164, 10.1.1.62.5284, 10.1.1.135.1963, 10.1.1.87.8154, 10.1.1.100.3202, 10.1.1.12.7001, 10.1.1.132.692, 10.1.1.11.1562, 10.1.1.11.2318, 10.1.1.103.7850, 10.1.1.104.743, 10.1.1.73.2408, 10.1.1.87.4008, 10.1.1.59.8392, 10.1.1.65.466, 10.1.1.73.7683, 10.1.1.79.8057, 10.1.1.10.8826, 10.1.1.75.4725, 10.1.1.83.1999, 10.1.1.101.711, 10.1.1.87.7921, 10.1.1.59.7728, 10.1.1.6.797, 10.1.1.69.848, 10.1.1.72.3845, 10.1.1.74.4124, 10.1.1.81.4119, 10.1.1.98.7154, 10.1.1.19.8509, 10.1.1.3.9371, 10.1.1.6.2523, 10.1.1.60.6795, 10.1.1.8.4666, 10.1.1.130.301, 10.1.1.1.5117, 10.1.1.1.6555, 10.1.1.104.4496, 10.1.1.107.557, 10.1.1.3.6849, 10.1.1.59.7543, 10.1.1.64.1266, 10.1.1.65.2889, 10.1.1.90.6697, 10.1.1.91.2638, 10.1.1.91.9909, 10.1.1.92.8869, 10.1.1.115.2954, 10.1.1.109.5598, 10.1.1.4.7586, 10.1.1.60.7718, 10.1.1.64.7168, 10.1.1.65.5700, 10.1.1.10.3514, 10.1.1.124.8058, 10.1.1.71.9929, 10.1.1.72.5959, 10.1.1.75.7060, 10.1.1.77.2163, 10.1.1.84.9949, 10.1.1.88.7691, 10.1.1.124.6955, 10.1.1.135.938, 10.1.1.1.6152, 10.1.1.10.4050, 10.1.1.100.4685, 10.1.1.101.4635, 10.1.1.101.5586, 10.1.1.101.7774, 10.1.1.101.923, 10.1.1.101.9698, 10.1.1.102.2463, 10.1.1.103.3623, 10.1.1.104.246, 10.1.1.104.5826, 10.1.1.104.7981, 10.1.1.105.6886, 10.1.1.105.7709, 10.1.1.105.8473, 10.1.1.106.5025, 10.1.1.107.2311, 10.1.1.107.661, 10.1.1.108.7405, 10.1.1.109.2063, 10.1.1.109.2744, 10.1.1.109.5066, 10.1.1.109.5814, 10.1.1.12.821, 10.1.1.14.678, 10.1.1.16.1955, 10.1.1.2.7163, 10.1.1.126.4018, 10.1.1.23.6786, 10.1.1.3.3738, 10.1.1.3.9378, 10.1.1.4.9298, 10.1.1.58.4737, 10.1.1.59.4941, 10.1.1.59.9438, 10.1.1.60.8088, 10.1.1.61.6630, 10.1.1.62.7084, 10.1.1.62.8029, 10.1.1.63.6966, 10.1.1.67.8257, 10.1.1.128.5213, 10.1.1.114.2115, 10.1.1.69.6839, 10.1.1.101.5381, 10.1.1.70.1855, 10.1.1.70.7053, 10.1.1.71.9173, 10.1.1.73.3817, 10.1.1.73.4209, 10.1.1.73.6992, 10.1.1.73.8426, 10.1.1.73.8890, 10.1.1.74.2037, 10.1.1.74.2586, 10.1.1.74.4731, 10.1.1.75.5353, 10.1.1.76.3249, 10.1.1.77.2216, 10.1.1.77.5431, 10.1.1.77.6371, 10.1.1.78.7409, 10.1.1.78.9441, 10.1.1.79.6600, 10.1.1.79.9066, 10.1.1.117.944, 10.1.1.80.1731, 10.1.1.84.2934, 10.1.1.84.8725, 10.1.1.84.9723, 10.1.1.86.4391, 10.1.1.88.3774, 10.1.1.9.3453, 10.1.1.90.7416, 10.1.1.91.626, 10.1.1.91.6753, 10.1.1.92.483, 10.1.1.93.6756, 10.1.1.96.1138, 10.1.1.109.5748, 10.1.1.96.580, 10.1.1.96.9998, 10.1.1.97.5483, 10.1.1.97.7893, 10.1.1.98.175, 10.1.1.98.5749, 10.1.1.98.6421, 10.1.1.99.7054, 10.1.1.99.9583, 10.1.1.110.202, 10.1.1.110.5494, 10.1.1.113.5231, 10.1.1.115.2639, 10.1.1.115.4480, 10.1.1.115.5487, 10.1.1.118.8788, 10.1.1.119.8147, 10.1.1.121.9928, 10.1.1.124.3338, 10.1.1.125.2509, 10.1.1.125.3893, 10.1.1.18.4531, 10.1.1.130.8262, 10.1.1.131.2308, 10.1.1.132.6275, 10.1.1.138.5378