Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 20400 invoked from network); 23 Apr 2004 08:37:03 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 23 Apr 2004 08:37:03 -0000 Received: (qmail 42201 invoked by uid 500); 23 Apr 2004 08:36:31 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 42179 invoked by uid 500); 23 Apr 2004 08:36:31 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 42164 invoked from network); 23 Apr 2004 08:36:31 -0000 Received: from unknown (HELO getopt.org) (69.44.16.11) by daedalus.apache.org with SMTP; 23 Apr 2004 08:36:31 -0000 Received: from getopt.org (75-mo3-2.acn.waw.pl [62.121.105.75]) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id i3N8ai917319 for ; Fri, 23 Apr 2004 03:36:44 -0500 Message-ID: <4088D545.5000004@getopt.org> Date: Fri, 23 Apr 2004 10:35:17 +0200 From: Andrzej Bialecki User-Agent: Mozilla Thunderbird 0.5 (Windows/20040207) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: Stemmer Benefits/Costs References: <05c201c428ae$d496a700$6501a8c0@POWERPACK> <40883B26.7030707@getopt.org> <05ea01c428c5$70643420$6501a8c0@POWERPACK> In-Reply-To: <05ea01c428c5$70643420$6501a8c0@POWERPACK> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Terry Steichen wrote: > So, Andrez - Thank you for your comments - what you say makes a good deal of > sense. When you have lots of different inflections that all share the same > root, stemming can clearly provide significant (recall) benefits (in terms > of catching hidden words and/or simplifying the query). > > However, would you say that "from the perspective of English" ("with its > minimal inflection") the points I raise are correct? (You seem to say so > with the statement that stemming "usually improves recall, but lowers > precision.") > > And, would you expect significant benefits from the Egothor project code > (versus Snowball/Porter) when the text is in English (as opposed to a highly > inflectional language like Polish)? I did only minimal testing with English (and somewhat more extensive with scandinavian languages). Results were also promising, if somewhat unexpected. The unexpected part comes from the realization that stemming doesn't have to produce any real "root" as long as it provides you with a unique key for all inflected forms derived from the same base form (lemma). So, from this point of view it's perfectly ok if you get "blurfl" as a stem of "give", as long as you get the same "blurfl" for "gave, given, gives, giving", and for nothing else. Think of it like a hashCode() ... Egothor's stemmer package is not as abstract as in this example - usually (> 90% for English? ~70% for Polish) stems that it produces correspond to real base forms (lemmas). It's algorithm is based on state machines with memory, stored in a trie, in form of patch commands. This allows it to handle not only suffix-based inflection but also prefix/infix. Stemming tables are learned from training corpora, which consit of base and inflected forms. Resulting state machine binary in case of Polish weighs around 300kB. For English it's smaller - somewhere around 100kB. In my experiments the state machine for English was able to find correct lemmas more often than other types of stemmers (sorry for such a poorly qualified statement - as I said, I didn't make so systematic testing for English). Over/under-stemming didn't occur as often. So, my cautious advice would be to give it a try.. :-) Now, if you again use the analogy of hash code, inevitably some collisions will occur during stemming. I.e. some inflected forms, which correspond to different lemmas (roots) will be brought to the same stem. This is where you lose precision. Also, in some cases stemmers will produce two stems from a group of words having the same lemma. This is where you lose recall (because now the "stem" covers only a subset of all possible inflections). This leads to an interesting conclusion: by blindly using stemmers (which may not be suitable for your corpus, or the document's language), in one sweep you can nicely lower BOTH the precision and recall measures! Clearly, not what one would expect or desire ... :-) To summarize: IMHO indiscriminate use of stemming, soundexing and other language-specific techniques in general is more likely to reduce the quality of your results. However, when applied correctly, for a well-known corpus and use cases, it can bring significant increase in recall, and only a minimal penalty in precision. -- Best regards, Andrzej Bialecki ------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org) --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org