Stemming algorithms for various European languages


 

Links to resources

Snowball main page
English (porter)
English (porter2)
Romance stemmers:
French
Spanish
Portuguese
Italian
Germanic stemmers
German
Dutch
Scandinavian stemmers
Swedish
Norwegian
Danish
Russian
Finnish


We present stemming algorithms, and Snowball stemmers, for English, for Russian, for the Romance languages French, Spanish, Portuguese and Italian, for German and Dutch, for Swedish, Norwegian (bokmål dialect) and Danish, and for Finnish.

Note that by i-suffix we mean inflexional suffix, and by d-suffix, derivational suffix (*).

There are two English stemmers, the original Porter stemmer, and an improved stemmer which has been called Porter2. Read the accounts of them to learn a bit more about using Snowball.

The early Lovins stemmer for English is also available. And we also have Snowball stemmers for the Schinke/Willett Latin algorithm, and the Kraaij/Pohlmann Dutch algorithm. Let us know if you would like to see them.

Each formal algorithm should be compared with the corresponding Snowball program.

Surprisingly, among the Indo-European languages (*), the French stemmer turns out to be the most complicated, whereas the Russian stemmer, despite its large number of suffixes, is very simple. In fact it is interesting that English, with its minimal use of i-suffixes, has such a complex stemmer. This is partly due to the delicate nature of i-suffix removal (undoubling the p after removing ing from hopping etc), and partly to the wealth of forms of d-suffixes, deriving as they do from the mixed Romance and Germanic ancestry of the language.