Publication View

Automatic word lemmatization (2008)

Abstract
This paper is addressing a problem of automatic word lemmatization using machine learning techniques. We illustrate a way that sequential modeling can be used to improve the classification results, in particular to enable modeling sub-problems mostly having less than 10 class values, instead of addressing all 156 class values in one problem. We independently induced two models for automatic lemmatization of words using the complementary data representation modeled by using (1) a set of if-then classification rules and (2) the naive Bayes classifier. The model induction was based on a set of hand labeled words of the form (word, lemma) and some unlabeled data. We used data for Slovenian language as an example problem, but the approach can be applied to any natural language provided the data is available. Actually, the labeled data we have used is a part of a larger dataset containing the same kind of information for different European languages. The data representation was based on two independent feature sets describing the same examples. The first feature set is using the letters in the words to give a structured representation of words on which a “classical ” learning algorithm is applied. In our case we used classification rules algorithm, that was shown to work well on different machine learning problems. The second feature set is using the unlabeled data to get context of the words representing each example with a set of short documents- context. Here we applied the Naive Bayesian classifier directly on text data, as an approach shown to work well in document classification. The experimental results show that both approaches perform better than a simple, majority classifier. 1.

Publication details
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=?doi=10.1.1.113.8678
Source http://nl.ijs.si/isjt02/zbornik/sdjt02-26mladenic.pdf
Contributors CiteSeerX
Repository CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Type text
Language English
Relation 10.1.1.114.9164, 10.1.1.65.9324, 10.1.1.28.8552, 10.1.1.31.1664