lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: sanity check on how stemming, stopwords, and snowball analyzer works together
Date Mon, 15 Oct 2007 14:37:22 GMT
Sounds right to me.

The other option I think you have is to not use the MoreLikeThis 
stopword functionality. Instead add the stopwords to the analyzer that 
you pass to MoreLikeThis. That way you can ensure that the analyzer 
applies the stopword list before stemming (The MoreLikeThis stopword 
removal is implemented so that stopwords are removed after stemming). 
Then you just have to add 'developer' to the stop list, and you can 
forget about handling stemmed forms.

Your method should also work though.

- Mark

Donna L Gresh wrote:
> Could those "in the know" comment on my current understanding of stemming 
> and stopwords using the snowball analyzer?
>
> In my application, I am using the MoreLikeThis class to find similar 
> documents to an input "text blob". There are words in the input text blob 
> which are "uninteresting" for my application, so I create a list of these 
> words. These words are "uninteresting" no matter what their tense or 
> usage, for example, "develop", "developing", "developed", and "developer" 
> are all uninteresting and I do not want them included in the search query 
> created by the MoreLikeThis class.
>
> My index documents are stemmed using the Snowball analyzer. I do not use 
> any stopwords when the documents are indexed (as I would like the choice 
> of stopwords to be under user control at search time).
>
> I would like the user to be able to provide to the search application a 
> list of "uninteresting" words, and for obvious reasons would like to force 
> them to provide only, say, "developer" and have the application understand 
> that all variants should be ignored (and I don't want to force them to try 
> to guess what the stemmed version of "developer" is).
>
> My first try was to use MoreLikeThis with the Snowball analyzer and a 
> simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and 
> MoreLikeThis.setStopWords). However, it appears that the stopwords 
> provided to the MoreLikeThis class are compared in an exact way to the 
> token stream output by the Snowball filter (where the words have been 
> stemmed), so "developer" will not match anything, and all variants pass 
> through. Even if I provide the list of unstemmed stopwords to the snowball 
> analyzer instead, they are used "as-is" with no stemming performed, so 
> "developer" will not remove "developed". 
>
> Apparently the following is necessary for my application:
> Construct a snowball analyzer with no stopwords. Use the unstemmed 
> stopword list with the analyzer to construct a stemmed version of the set 
> of stopwords. Use this set of stemmed stopwords as the stopwords input to 
> the MoreLikeThis class (where the tokens are compared to the stemmed 
> versions after been output from the Snowball analyzer).
>
> Is my understanding correct?
>
> Donna
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message