lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donna L Gresh <gr...@us.ibm.com>
Subject Re: sanity check on how stemming, stopwords, and snowball analyzer works together
Date Mon, 15 Oct 2007 14:55:11 GMT
I wasn't sure this:
Instead add the stopwords to the analyzer that 
> you pass to MoreLikeThis. That way you can ensure that the analyzer 
> applies the stopword list before stemming 

would work, because I don't want to provide all the variants of the 
stopword list-- if I do this, only the one provided will be removed, 
correct?


Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com


Mark Miller <markrmiller@gmail.com> wrote on 10/15/2007 10:37:22 AM:

> Sounds right to me.
> 
> The other option I think you have is to not use the MoreLikeThis 
> stopword functionality. Instead add the stopwords to the analyzer that 
> you pass to MoreLikeThis. That way you can ensure that the analyzer 
> applies the stopword list before stemming (The MoreLikeThis stopword 
> removal is implemented so that stopwords are removed after stemming). 
> Then you just have to add 'developer' to the stop list, and you can 
> forget about handling stemmed forms.
> 
> Your method should also work though.
> 
> - Mark
> 
> Donna L Gresh wrote:
> > Could those "in the know" comment on my current understanding of 
stemming 
> > and stopwords using the snowball analyzer?
> >
> > In my application, I am using the MoreLikeThis class to find similar 
> > documents to an input "text blob". There are words in the input text 
blob 
> > which are "uninteresting" for my application, so I create a list of 
these 
> > words. These words are "uninteresting" no matter what their tense or 
> > usage, for example, "develop", "developing", "developed", and 
"developer" 
> > are all uninteresting and I do not want them included in the search 
query 
> > created by the MoreLikeThis class.
> >
> > My index documents are stemmed using the Snowball analyzer. I do not 
use 
> > any stopwords when the documents are indexed (as I would like the 
choice 
> > of stopwords to be under user control at search time).
> >
> > I would like the user to be able to provide to the search application 
a 
> > list of "uninteresting" words, and for obvious reasons would like to 
force 
> > them to provide only, say, "developer" and have the application 
understand 
> > that all variants should be ignored (and I don't want to force them to 
try 
> > to guess what the stemmed version of "developer" is).
> >
> > My first try was to use MoreLikeThis with the Snowball analyzer and a 
> > simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and 
> > MoreLikeThis.setStopWords). However, it appears that the stopwords 
> > provided to the MoreLikeThis class are compared in an exact way to the 

> > token stream output by the Snowball filter (where the words have been 
> > stemmed), so "developer" will not match anything, and all variants 
pass 
> > through. Even if I provide the list of unstemmed stopwords to the 
snowball 
> > analyzer instead, they are used "as-is" with no stemming performed, so 

> > "developer" will not remove "developed". 
> >
> > Apparently the following is necessary for my application:
> > Construct a snowball analyzer with no stopwords. Use the unstemmed 
> > stopword list with the analyzer to construct a stemmed version of the 
set 
> > of stopwords. Use this set of stemmed stopwords as the stopwords input 
to 
> > the MoreLikeThis class (where the tokens are compared to the stemmed 
> > versions after been output from the Snowball analyzer).
> >
> > Is my understanding correct?
> >
> > Donna
> >
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message