Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 68768 invoked from network); 28 Jun 2010 17:54:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 Jun 2010 17:54:56 -0000 Received: (qmail 56626 invoked by uid 500); 28 Jun 2010 17:54:54 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 56571 invoked by uid 500); 28 Jun 2010 17:54:53 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 56563 invoked by uid 99); 28 Jun 2010 17:54:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jun 2010 17:54:53 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [174.120.4.2] (HELO frome.webserversystems.com) (174.120.4.2) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jun 2010 17:54:46 +0000 Received: from [79.161.195.35] (helo=[10.0.5.111]) by frome.webserversystems.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.69) (envelope-from ) id 1OTIXI-0003gW-Dz for solr-user@lucene.apache.org; Mon, 28 Jun 2010 12:54:24 -0500 Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1078) Subject: Re: preside != president From: =?iso-8859-1?Q?Jan_H=F8ydahl_/_Cominvent?= In-Reply-To: Date: Mon, 28 Jun 2010 19:54:22 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1277726685.6747.2.camel@kratos> <9F5FCD40-C9BB-4CFB-BB0D-D3CDF168099A@gmail.com> <9eb24a79bbfe195513fa05e0ce2c654c.squirrel@sm.webmail.pair.com> To: solr-user@lucene.apache.org X-Mailer: Apple Mail (2.1078) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - frome.webserversystems.com X-AntiAbuse: Original Domain - lucene.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - cominvent.com X-Source: X-Source-Args: X-Source-Dir: X-Virus-Checked: Checked by ClamAV on apache.org Hi, You might also want to check out the new Lucene-Hunspell stemmer at = http://code.google.com/p/lucene-hunspell/ It uses OpenOffice dictionaries with known stems in combination with a = large set of language specific rules. It handles your example, but it is an early release, so test it = thoroughly before deploying in production :) -- Jan H=F8ydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 28. juni 2010, at 17.43, Joe Calderon wrote: > the general consensus among people who run into the problem you have > is to use a plurals only stemmer, a synonyms file or a combination of > both (for irregular nouns etc) >=20 > if you search the archives you can find info on a plurals stemmer >=20 > On Mon, Jun 28, 2010 at 6:49 AM, wrote: >> Thanks for the tip. Yeah, I think the stemming confounds search = results as >> it stands (porter stemmer). >>=20 >> I was also thinking of using my dictionary of 500,000 words with = their >> complete morphologies and conjugations and create a synonyms.txt to >> provide english accurate morphology. >>=20 >> Is this a good idea? >>=20 >> Darren >>=20 >>> Hi Darren, >>>=20 >>> You might want to look at the KStemmer >>> (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem) >>> instead of the standard PorterStemmer. It essentially has a = 'dictionary' >>> of exception words where stemming stops if found, so in your case >>> president won't be stemmed any further than president (but = presidents will >>> be stemmed to president). You will have to integrate it into solr >>> yourself, but that's straightforward. >>>=20 >>> HTH >>> Brendan >>>=20 >>>=20 >>> On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote: >>>=20 >>>> Hi, >>>> It seems to me that because the stemming does not produce >>>> grammatically correct stems in many of the cases, >>>> search anomalies can occur like the one I am seeing where I have a >>>> document with "president" in it and it is returned >>>> when I search for "preside", a different word entirely. >>>>=20 >>>> Is this correct or acceptable behavior? Previous discussions here = on >>>> stemming, I was told its ok as long as all the words reduce >>>> to the same stem, but when different words reduce to the same stem = it >>>> seems to affect search results in a "bad way". >>>>=20 >>>> Darren >>>=20 >>>=20 >>=20 >>=20