lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <markharw...@yahoo.co.uk>
Subject Re: Plural Stemming
Date Sat, 02 Apr 2005 10:40:35 GMT
 >>Stemming doesn't have to produce intelligible words

True, yes this should be fine for general search requirements.
However, the code presented does make some attempt to produce 
intelligible words eg parties=party unlike Porter stemmer's parties=parti
Does this make it a "lemmatizer"?
This is a feature I find useful in my particular app which is a utility 
which discovers all the main collocations in an index eg "stag party". 
The utitility reads indexed terms so it is useful if they are 
intelligible because they can be used as suggested spelling corrections 
or possible query refinements. Used in this context, as a "lemmatizer" 
(if this is the right word) it doesn't  seem to be doing a bad job of 
producing generally readable words that can be presented back to the end 
user.

Cheers,
Mark




Andrzej Bialecki wrote:

> mark harwood wrote:
>
>> Just ran this method on 4500 words ending in "s" in my
>> index and results looks good but I'm tempted to remove
>> this line:
>>
>>            !word.endsWith("ses") )
>> With it removed I saw 3 oddities moses=mose gases=gase
>> viruses=viruse but I got 100+ extra stems that were
>> OK:
>
>
> Stemming doesn't have to produce intelligible words, it's the 
> lemmatization that does. As long as the stem is unique, and all 
> inflected forms of a single base form map to the same stem, it's ok.
>
> In the case above the probability of another word producing the same 
> stem "mose" is very low, so this stem is ok, too.
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message