lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Steichen" <te...@net-frame.com>
Subject Re: Stemmer Benefits/Costs
Date Thu, 22 Apr 2004 23:58:18 GMT
Andrzej,

Sorry for misspelling your name.  My Polish sucks.

Terry

----- Original Message ----- 
From: "Terry Steichen" <terry@net-frame.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, April 22, 2004 7:56 PM
Subject: Re: Stemmer Benefits/Costs


> So, Andrez - Thank you for your comments - what you say makes a good deal
of
> sense.  When you have lots of different inflections that all share the
same
> root, stemming can clearly provide significant (recall) benefits (in terms
> of catching hidden words and/or simplifying the query).
>
> However, would you say that "from the perspective of English" ("with its
> minimal inflection") the points I raise are correct?  (You seem to say so
> with the statement that stemming "usually improves recall, but lowers
> precision.")
>
> And, would you expect significant benefits from the Egothor project code
> (versus Snowball/Porter) when the text is in English (as opposed to a
highly
> inflectional language like Polish)?
>
> Regards,
>
> Terry
>
> ----- Original Message ----- 
> From: "Andrzej Bialecki" <ab@getopt.org>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Thursday, April 22, 2004 5:37 PM
> Subject: Re: Stemmer Benefits/Costs
>
>
> > Terry Steichen wrote:
> >
> > > I've been experimenting with the Porter and Snowball stemmers.  It
> > > seems to me that one of the most valuable benefits these provide is
> > > the capability to generalize phrase terms.  As a very simple example,
> > > without the stemmer, I might need to include three phrase terms in my
> > > query: "north korea", "north korean", "north koreans".  But with the
> > > stemmer only one will suffice.  To me, that's a huge advantage.  (For
> > > non-phrases, the advantage doesn't seem to be so great, because much
> > > the same effect can be achieved with wildcards.)
> >
> > That's because you look at it from the perspective of English language
> > with its minimal inflection... My mother tongue is Polish - a highly
> > inflectional language from the Slavic family of languages. It is normal
> > for a single Polish word to have as many as 20+ different inflected
> > forms (plural/singular/dual, tense, gender, mood, case, infinitive...
> > enough? ;-) ). For this type of language studies show that stemming (or
> > rather lemmatization - bringing words to their base grammatical forms)
> > significantly improves recall in IR systems.
> >
> > >
> > > But there seems to be a price that you also pay, in that
> > > discrimination may be adversely affected.  If you want to
> > > discriminate between two terms that the stemmer views as derived from
> > > the same root, you're out of luck (I think).  The problem with this
> >
> > Stemming usually improves recall, but lowers precision. For some systems
> > it is more desirable to provide any results, even if they are not quite
> > correct, than to provide none.
> >
> > > is that you may start with a set of terms that don't have this
> > > problem, but over time as new content is added to the index, such
> > > problems may gradually get introduced - often unpredictably.  And to
> > > the best of my (admittedly limited) knowledge, once you've indexed
> > > using a stemmer, there's no way to override it in specific instances.
> >
> > You can always store in your index stemmed/non-stemmed terms alongside.
> >
> > >
> > > Appreciate any comments, thoughts on the above.
> >
> > For highly-inflectional languages I had _very_ good results with
> > stemmers built using the code from Egothor project
> > (http://www.egothor.org) - much more sophisticated than simple
> > rule-based stemmers like Snowball or Porter. In fact, after proper
> > training on a large corpus I was getting ~70% of correct lemmas for
> > previously unseen words, and over 90% of correct (unique) stems.
> >
> > -- 
> > Best regards,
> > Andrzej Bialecki
> >
> > -------------------------------------------------
> > Software Architect, System Integration Specialist
> > CEN/ISSS EC Workshop, ECIMF project chair
> > EU FP6 E-Commerce Expert/Evaluator
> > -------------------------------------------------
> > FreeBSD developer (http://www.freebsd.org)
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message