lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jong Kim" <j...@sitescape.com>
Subject RE: Looking for a stemmer that can return all inflected forms
Date Sun, 15 Oct 2006 22:18:45 GMT
All: Thanks for the ideas and suggestions. 

Bill: As Otis pointed out, Lucene already comes with a couple
of stemmers (I'm using Lucene 2.0). Besides PorterStemFilter,
you can also take a look at SnowballAnalyzer and SnowballFilter
classes which support more than just English. The integration
is pretty straightforward.

/Jong

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Sunday, October 15, 2006 12:38 AM
To: java-user@lucene.apache.org
Subject: Re: Looking for a stemmer that can return all inflected forms

Bill: Lucene already comes with PorterStemFilter (class name), which you can
use for English.

Ideas 1 and 2 sound interesting, but I think they may end up offering false
positives.  The reason is obvious - multiple and unrelated words can get
stemmed to the same stem.
Is "care" really the stem for "caring"?  Maybe.  But imagine the stem is
"car".  Suddenly the word "cars" shares the same "car" stem and you have a
false positive.

Jong: I _think_ what you need is a "reverse lemmatizer".

Otis

----- Original Message ----
From: Bill Taylor <wataylor@as-st.com>
To: java-user@lucene.apache.org
Cc: Jong Kim <jkim@sitescape.com>
Sent: Saturday, October 14, 2006 11:43:10 PM
Subject: Re: Looking for a stemmer that can return all inflected forms

On Oct 14, 2006, at 3:57 PM, Jong Kim wrote:

> Hi,
>
> I'm looking for a stemmer that is capable of returning all 
> morphological variants  of a query term (to be used for high-recall 
> search). For example, given a query term of 'cares', I would like to 
> be able to generate 'cares', 'care', 'cared', and 'caring'.
>
> I looked at the Porter stemmer, Snowball stemmer, and the K-stem.
> All of them provide a method that takes a surface string ('cares') as 
> an input and returns its base form/stem, which is 'care' in this 
> example.

First of all, I would GREATLY appreciate it if you would tell me which of
these is easiest to incorporate into Lucene.  I have the same problem you
do.  I have solved the other end of it but do not knot how to fit a stemmer
into Lucene.

> But it appears that I can not use the stemmer to generate all of the 
> inflected forms of a given query term.
>
> Does anyone know of such tool for Lucene?

I am writing one which is VERY SPECIAL PURPOSE and therefore my code not
likely to be of much use to you.  HOWEVER, the basic idea is quite
simple:

Idea 1:

1) Since you have to use the stemmer against something, you are reading
words out of the index and extracting their stems.

2) Having done that for a word, find all "nearby" words which have the same
stem.  The simplest definition of "nearby" that I can think of is that the
word starts with the stem, but you might want to drop the last character of
the stem and look for all words that start with that.  
Thus, if the stem is "care" you would look at all words that start with
"car" and if they have "care" as the stem, they are in the same family.

The advantage of this approach is that you do not ever offer any words that
are not in your index.  If you found cares and cared but not caring in your
index, you would not want to suggest that someone search for caring because
they won't find it.  So you use the index as the source of words to stem.

Idea 2:

Another way to do it is to build a hash map of tree sets keyed to the stem.
Each stem has a tree set of all words which have it as a stem.  
The code would look something like

HashMap<String, TreeSet> stemmedWords = new HashMap<String, TreeSet>();
TreeSet<String> wordsForStem;

for (String word : all words in the index)  {
    stem = MagicStemmer(word);  // I left out code for words that do not
have stems
           if ( (wordsForStem = stemmedWords.get(stem)) == null) {
               wordsForStem = new TreeSet<String>();  // Tree set for the
new stem
               stemmedWords.put(stem, wordsForStem); // Now this stem has a
set for its words
          }
         wordsForStem.add(word); // Put the word into the tree set for its
stem }

For each stem from all the words in your index, you get a tree set which
contains all the words which have it as a stem;  The tree set keeps its
words in alphabetical order.

If you want the stems to be displayed in alphabetical order, use a TreeMap
instead of a HashMap.

> Any help or pointer would be greatly appreciated.

I would appreciate your telling me which stemmer for English words is
easiest to incorporate into Lucene and where to find it.  Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message