A new lemmatizer that handles morphological changes in pre- in- and suffixes alike
talk by Bart Jongejan, CST, University of Copenhagen, Tuesday, May 6, 2008, at 13.00-14.45, sammanträdesrummet 7501, Forum, DSV, Kista.
In some Indo-European languages like English and the North Germanic languages, most words can be lemmatized by removing or replacing a suffix. In languages like German and Dutch, on the other hand, lemmatization often proceeds regularly by removing, adding or replacing other types of affixes and even by combinations of such string operations.
The rules for the new lemmatizer are created by automatic training on a large sample set of full form – lemma pairs. An attempt was made to allow a rule-based attribution of a word to more than one lemma (when appropriate), but this had to be given up. The current implementation produces one lemma per word when the lemmatization rules are applied and relies on an optional built-in dictionary to produce additional correct lemmas of known words only.
The first results of tests show that the new lemmatizer probably has a higher accuracy than the former CSTlemma software, even with languages that have mainly suffix morphology, but that the errors it makes sometimes may be “more wrong” than the errors made by the old CSTlemma software.