lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan H√łydahl / Cominvent <jan....@cominvent.com>
Subject Re: preside != president
Date Mon, 28 Jun 2010 17:54:22 GMT
Hi,

You might also want to check out the new Lucene-Hunspell stemmer at http://code.google.com/p/lucene-hunspell/
It uses OpenOffice dictionaries with known stems in combination with a large set of language
specific rules.
It handles your example, but it is an early release, so test it thoroughly before deploying
in production :)

--
Jan H√łydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 28. juni 2010, at 17.43, Joe Calderon wrote:

> the general consensus among people who run into the problem you have
> is to use a plurals only stemmer, a synonyms file or a combination of
> both (for irregular nouns etc)
> 
> if you search the archives you can find info on a plurals stemmer
> 
> On Mon, Jun 28, 2010 at 6:49 AM,  <darren@ontrenet.com> wrote:
>> Thanks for the tip. Yeah, I think the stemming confounds search results as
>> it stands (porter stemmer).
>> 
>> I was also thinking of using my dictionary of 500,000 words with their
>> complete morphologies and conjugations and create a synonyms.txt to
>> provide english accurate morphology.
>> 
>> Is this a good idea?
>> 
>> Darren
>> 
>>> Hi Darren,
>>> 
>>> You might want to look at the KStemmer
>>> (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem)
>>> instead of the standard PorterStemmer. It essentially has a 'dictionary'
>>> of exception words where stemming stops if found, so in your case
>>> president won't be stemmed any further than president (but presidents will
>>> be stemmed to president). You will have to integrate it into solr
>>> yourself, but that's straightforward.
>>> 
>>> HTH
>>> Brendan
>>> 
>>> 
>>> On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote:
>>> 
>>>> Hi,
>>>>  It seems to me that because the stemming does not produce
>>>> grammatically correct stems in many of the cases,
>>>> search anomalies can occur like the one I am seeing where I have a
>>>> document with "president" in it and it is returned
>>>> when I search for "preside", a different word entirely.
>>>> 
>>>> Is this correct or acceptable behavior? Previous discussions here on
>>>> stemming, I was told its ok as long as all the words reduce
>>>> to the same stem, but when different words reduce to the same stem it
>>>> seems to affect search results in a "bad way".
>>>> 
>>>> Darren
>>> 
>>> 
>> 
>> 


Mime
View raw message