Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 31142 invoked from network); 22 Apr 2004 21:14:55 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 22 Apr 2004 21:14:55 -0000 Received: (qmail 55366 invoked by uid 500); 22 Apr 2004 21:14:35 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 55312 invoked by uid 500); 22 Apr 2004 21:14:34 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 55266 invoked from network); 22 Apr 2004 21:14:34 -0000 Received: from unknown (HELO tisch.mail.mindspring.net) (207.69.200.157) by daedalus.apache.org with SMTP; 22 Apr 2004 21:14:34 -0000 Received: from h-66-167-145-177.mclnva23.dynamic.covad.net ([66.167.145.177] helo=POWERPACK) by tisch.mail.mindspring.net with smtp (Exim 3.33 #1) id 1BGlWp-000300-00 for lucene-user@jakarta.apache.org; Thu, 22 Apr 2004 17:14:40 -0400 Message-ID: <05c201c428ae$d496a700$6501a8c0@POWERPACK> From: "Terry Steichen" To: "Lucene Users Group" Subject: Stemmer Benefits/Costs Date: Thu, 22 Apr 2004 17:14:42 -0400 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_05BF_01C4288D.4D3BA1F0" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1409 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N ------=_NextPart_000_05BF_01C4288D.4D3BA1F0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I've been experimenting with the Porter and Snowball stemmers. It seems = to me that one of the most valuable benefits these provide is the = capability to generalize phrase terms. As a very simple example, = without the stemmer, I might need to include three phrase terms in my = query: "north korea", "north korean", "north koreans". But with the = stemmer only one will suffice. To me, that's a huge advantage. (For = non-phrases, the advantage doesn't seem to be so great, because much the = same effect can be achieved with wildcards.) But there seems to be a price that you also pay, in that discrimination = may be adversely affected. If you want to discriminate between two = terms that the stemmer views as derived from the same root, you're out = of luck (I think). The problem with this is that you may start with a = set of terms that don't have this problem, but over time as new content = is added to the index, such problems may gradually get introduced - = often unpredictably. And to the best of my (admittedly limited) = knowledge, once you've indexed using a stemmer, there's no way to = override it in specific instances. Appreciate any comments, thoughts on the above. Regards, Terry ------=_NextPart_000_05BF_01C4288D.4D3BA1F0--