lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donna L Gresh <gr...@us.ibm.com>
Subject sanity check on how stemming, stopwords, and snowball analyzer works together
Date Mon, 15 Oct 2007 14:19:51 GMT
Could those "in the know" comment on my current understanding of stemming 
and stopwords using the snowball analyzer?

In my application, I am using the MoreLikeThis class to find similar 
documents to an input "text blob". There are words in the input text blob 
which are "uninteresting" for my application, so I create a list of these 
words. These words are "uninteresting" no matter what their tense or 
usage, for example, "develop", "developing", "developed", and "developer" 
are all uninteresting and I do not want them included in the search query 
created by the MoreLikeThis class.

My index documents are stemmed using the Snowball analyzer. I do not use 
any stopwords when the documents are indexed (as I would like the choice 
of stopwords to be under user control at search time).

I would like the user to be able to provide to the search application a 
list of "uninteresting" words, and for obvious reasons would like to force 
them to provide only, say, "developer" and have the application understand 
that all variants should be ignored (and I don't want to force them to try 
to guess what the stemmed version of "developer" is).

My first try was to use MoreLikeThis with the Snowball analyzer and a 
simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and 
MoreLikeThis.setStopWords). However, it appears that the stopwords 
provided to the MoreLikeThis class are compared in an exact way to the 
token stream output by the Snowball filter (where the words have been 
stemmed), so "developer" will not match anything, and all variants pass 
through. Even if I provide the list of unstemmed stopwords to the snowball 
analyzer instead, they are used "as-is" with no stemming performed, so 
"developer" will not remove "developed". 

Apparently the following is necessary for my application:
Construct a snowball analyzer with no stopwords. Use the unstemmed 
stopword list with the analyzer to construct a stemmed version of the set 
of stopwords. Use this set of stemmed stopwords as the stopwords input to 
the MoreLikeThis class (where the tokens are compared to the stemmed 
versions after been output from the Snowball analyzer).

Is my understanding correct?

Donna

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message