lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Steichen" <>
Subject Stemmer Benefits/Costs
Date Thu, 22 Apr 2004 21:14:42 GMT
I've been experimenting with the Porter and Snowball stemmers.  It seems to me that one of
the most valuable benefits these provide is the capability to generalize phrase terms.  As
a very simple example, without the stemmer, I might need to include three phrase terms in
my query: "north korea", "north korean", "north koreans".  But with the stemmer only one will
suffice.  To me, that's a huge advantage.  (For non-phrases, the advantage doesn't seem to
be so great, because much the same effect can be achieved with wildcards.)

But there seems to be a price that you also pay, in that discrimination may be adversely affected.
 If you want to discriminate between two terms that the stemmer views as derived from the
same root, you're out of luck (I think).  The problem with this is that you may start with
a set of terms that don't have this problem, but over time as new content is added to the
index, such problems may gradually get introduced - often unpredictably.  And to the best
of my (admittedly limited) knowledge, once you've indexed using a stemmer, there's no way
to override it in specific instances.

Appreciate any comments, thoughts on the above.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message