lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Which stemmer?
Date Thu, 15 Nov 2012 23:02:05 GMT
One other factor to keep in mind is that the customer should never "look" at 
the actual stem term - such as "countri" or "gener" because in can freak 
them out a little, for no good reason. I mean, the goal of stemming is to 
show what set of words/terms will be treated as equivalent on a query, and 
this is independent of what gets returned for a stored field. The stem is 
simply the means to THAT end.

The fact that "dog" and "dogs" are not equivalent in KStem is in fact 
disheartening, at least to me, but it may not be problematic in some use 
cases.

-- Jack Krupansky

-----Original Message----- 
From: Scott Smith
Sent: Thursday, November 15, 2012 11:57 AM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?

Thanks for the suggestions I think Erick is correct as well.  I'll let the 
customer decide.

Here's an updated list.  Fyi--the minStem was the English Minimal Stemmer--I 
changed the label.  Interesting to see where the minimal stemmer and porter 
agree (and KStemmer doesn't).  You may also find the "dog" examples 
interesting.  I also found the "invest*" list entertaining.

   original       porter        kstem   EngMinStem
-----------  -----------  -----------  -----------
    country      countri      country      country
  countries      countri      country      country
  country's     country'    country's     country'
        run          run          run          run
       runs          run         runs          run
    running          run      running      running
       read         read         read         read
    reading         read      reading      reading
     reader       reader       reader       reader
association       associ  association  association
  associate       associ    associate    associate
    listing         list         list      listing
      water        water        water        water
    watered        water        water      watered
       sure         sure         sure         sure
     surely         sure       surely       surely
     invest       invest       invest       invest
  investing       invest       invest    investing
investment       invest   investment   investment
investments       invest   investment   investment
    invests       invest       invest       invest
   investor     investor       invest     investor
   invester       invest       invest     invester
  investors     investor       invest     investor
  investers       invest       invest     invester
organization        organ  organization  organization
   organize        organ     organize     organize
    organic        organ      organic      organic
   generous        gener     generous     generous
    generic        gener      generic      generic
        dog          dog          dog          dog
      dog's         dog'        dog's         dog'
       dogs          dog         dogs          dog
      dogs'          dog         dogs          dog

Now, if someone would answer my question on the Solr list ("Custom Solr 
Indexer/Search"), my day would be complete ;-).

Thanks for the continued help.

Scott

-----Original Message-----
From: Tom Burton-West [mailto:tburtonw@umich.edu]
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

I agree with Erick that you probably need to give your client a list of 
concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won't get searched.  For example, without stemming, 
searching for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming 
increases overstemming.  The problem with aggressive stemmers like the 
Porter stemmer, is that they overstem.

The original Porter stemmer for example would stem "organization" and " 
organic" both to "organ" and "generalization" , "generous"and "generic" to " 
gener"  *

For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer 
understems or overstems and explains the logic of Kstem: "Viewing Morphology 
as an Inference Process"  (*Krovetz*, R., Proceedings of the Sixteenth 
Annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Tom

http://www.hathitrust.org/blogs/large-scale-search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message