lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Plural Stemming
Date Sat, 02 Apr 2005 10:10:18 GMT
mark harwood wrote:
> Just ran this method on 4500 words ending in "s" in my
> index and results looks good but I'm tempted to remove
> this line:
> 
>            !word.endsWith("ses") ) 
> 
> With it removed I saw 3 oddities moses=mose gases=gase
> viruses=viruse but I got 100+ extra stems that were
> OK:

Stemming doesn't have to produce intelligible words, it's the 
lemmatization that does. As long as the stem is unique, and all 
inflected forms of a single base form map to the same stem, it's ok.

In the case above the probability of another word producing the same 
stem "mose" is very low, so this stem is ok, too.

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message