lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <soko...@falutin.net>
Subject Re: Which stemmer?
Date Fri, 16 Nov 2012 02:00:35 GMT
On 11/15/2012 1:06 PM, Tom Burton-West wrote:
> This paper on the Kstem stemmer lists cases where the Porter stemmer
> understems or overstems and explains the logic of Kstem: "Viewing
> Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
> Sixteenth Annual International ACM SIGIR Conference on Research and
> Development in Information Retrieval, 191-203, 1993).
>
> *http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
> "
>
Thanks for the reference - that was very enlightening.  The paper 
explains why many terms are not stemmed as one might expect by KStem - 
words that are found in the dictionary, by which I think they mean have 
their own senses whose definitions do not include the stem word, are not 
stemmed by KStem since it assumes that they have their own particular 
meanings, and are not derived *purely by inflection*.

The dictionary they used is the Longman dictionary, which is available 
for free online.  I looked up "dog" 
http://www.ldoceonline.com/dictionary/dog_1 and found that there is a 
sense there (sense 13) whose definition reads:


    dogs

[plural]American Englishinformalfeet:

this sense doesn't mention the stem word "dog" - it clearly has a 
different meaning than the main dog entry, so I guess the thinking 
behind this is: if the person was searching for "dogs" (meaning feet) 
they wouldn't want to find text with "dog" (meaning man's best friend).  
Of course in this case, "dog" singular presumably could mean foot as 
well, so the inference seems faulty, although perhaps that never 
occurs?  Honestly I've never heard of anyone using "dogs" to mean feet 
either, but hey nobody's perfect.

This entry: http://www.ldoceonline.com/dictionary/bound_4 probably 
explains the reason "bounds" doesn't stem to "bound".

In the Lucene KStemmer code, this translates into the word appearing in 
one of the dictionary data files.  If a word appears there (as "dogs" 
and "bounds" do), it won't be stemmed.  I suppose a possible approach 
here would be to send the client the dictionary of non-stemming words 
and let them remove some, but then you'd have to compile your own 
KStemmer variant.

Perhaps a nice feature to add to KStemmer would be to have it read a 
list of exception words at run-time that would be removed from its 
dictionary in order to allow them to be stemmed.

-Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message