Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 45750 invoked from network); 22 Apr 2004 21:39:26 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 22 Apr 2004 21:39:26 -0000 Received: (qmail 4069 invoked by uid 500); 22 Apr 2004 21:39:05 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 4024 invoked by uid 500); 22 Apr 2004 21:39:04 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 3977 invoked from network); 22 Apr 2004 21:39:04 -0000 Received: from unknown (HELO getopt.org) (69.44.16.11) by daedalus.apache.org with SMTP; 22 Apr 2004 21:39:04 -0000 Received: from getopt.org (75-mo3-2.acn.waw.pl [62.121.105.75]) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id i3MLdAA12465 for ; Thu, 22 Apr 2004 16:39:12 -0500 Message-ID: <40883B26.7030707@getopt.org> Date: Thu, 22 Apr 2004 23:37:42 +0200 From: Andrzej Bialecki User-Agent: Mozilla Thunderbird 0.5 (Windows/20040207) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Stemmer Benefits/Costs References: <05c201c428ae$d496a700$6501a8c0@POWERPACK> In-Reply-To: <05c201c428ae$d496a700$6501a8c0@POWERPACK> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Terry Steichen wrote: > I've been experimenting with the Porter and Snowball stemmers. It > seems to me that one of the most valuable benefits these provide is > the capability to generalize phrase terms. As a very simple example, > without the stemmer, I might need to include three phrase terms in my > query: "north korea", "north korean", "north koreans". But with the > stemmer only one will suffice. To me, that's a huge advantage. (For > non-phrases, the advantage doesn't seem to be so great, because much > the same effect can be achieved with wildcards.) That's because you look at it from the perspective of English language with its minimal inflection... My mother tongue is Polish - a highly inflectional language from the Slavic family of languages. It is normal for a single Polish word to have as many as 20+ different inflected forms (plural/singular/dual, tense, gender, mood, case, infinitive... enough? ;-) ). For this type of language studies show that stemming (or rather lemmatization - bringing words to their base grammatical forms) significantly improves recall in IR systems. > > But there seems to be a price that you also pay, in that > discrimination may be adversely affected. If you want to > discriminate between two terms that the stemmer views as derived from > the same root, you're out of luck (I think). The problem with this Stemming usually improves recall, but lowers precision. For some systems it is more desirable to provide any results, even if they are not quite correct, than to provide none. > is that you may start with a set of terms that don't have this > problem, but over time as new content is added to the index, such > problems may gradually get introduced - often unpredictably. And to > the best of my (admittedly limited) knowledge, once you've indexed > using a stemmer, there's no way to override it in specific instances. You can always store in your index stemmed/non-stemmed terms alongside. > > Appreciate any comments, thoughts on the above. For highly-inflectional languages I had _very_ good results with stemmers built using the code from Egothor project (http://www.egothor.org) - much more sophisticated than simple rule-based stemmers like Snowball or Porter. In fact, after proper training on a large corpus I was getting ~70% of correct lemmas for previously unseen words, and over 90% of correct (unique) stems. -- Best regards, Andrzej Bialecki ------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org) --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org