lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Burton-West <>
Subject Re: Which stemmer?
Date Thu, 15 Nov 2012 18:06:17 GMT
I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won’t get searched.  For example, without stemming, searching
for “dogs” would not retrieve documents containing the word “dog”.
Generally there is a precision/recall tradeoff where reducing understemming
increases overstemming.  The problem with aggressive stemmers like the
Porter stemmer, is that they overstem.

 The original Porter stemmer for example would stem “organization” and “
organic” both to “organ” and “generalization” , “generous”and “generic”
to “
gener”  *

For background on the Porter stemmers and lots of examples see these pages:



This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 191-203, 1993).



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message