Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <05c201c428ae$d496a700$6501a8c0@POWERPACK>
From: "Terry Steichen" <terry@net-frame.com>
To: "Lucene Users Group" <lucene-user@jakarta.apache.org>
Subject: Stemmer Benefits/Costs
Date: Thu, 22 Apr 2004 17:14:42 -0400
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_05BF_01C4288D.4D3BA1F0"

------=_NextPart_000_05BF_01C4288D.4D3BA1F0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I've been experimenting with the Porter and Snowball stemmers.  It seems =
to me that one of the most valuable benefits these provide is the =
capability to generalize phrase terms.  As a very simple example, =
without the stemmer, I might need to include three phrase terms in my =
query: "north korea", "north korean", "north koreans".  But with the =
stemmer only one will suffice.  To me, that's a huge advantage.  (For =
non-phrases, the advantage doesn't seem to be so great, because much the =
same effect can be achieved with wildcards.)

But there seems to be a price that you also pay, in that discrimination =
may be adversely affected.  If you want to discriminate between two =
terms that the stemmer views as derived from the same root, you're out =
of luck (I think).  The problem with this is that you may start with a =
set of terms that don't have this problem, but over time as new content =
is added to the index, such problems may gradually get introduced - =
often unpredictably.  And to the best of my (admittedly limited) =
knowledge, once you've indexed using a stemmer, there's no way to =
override it in specific instances.

Appreciate any comments, thoughts on the above.

Regards,

Terry
 
------=_NextPart_000_05BF_01C4288D.4D3BA1F0--