lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Burton-West <tburt...@umich.edu>
Subject Re: Which stemmer?
Date Thu, 15 Nov 2012 18:06:17 GMT
I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won’t get searched.  For example, without stemming, searching
for “dogs” would not retrieve documents containing the word “dog”.
Generally there is a precision/recall tradeoff where reducing understemming
increases overstemming.  The problem with aggressive stemmers like the
Porter stemmer, is that they overstem.

 The original Porter stemmer for example would stem “organization” and “
organic” both to “organ” and “generalization” , “generous”and “generic”
to “
gener”  *

For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Tom

http://www.hathitrust.org/blogs/large-scale-search

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message