lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??
Date Wed, 18 Feb 2004 19:01:47 GMT
Bruce Ritchie wrote:

> David Spencer wrote:
>
>> [c] "interesting words" - uses code from MoreLikeThis to give a table 
>> of all interesting
>> words in the current "source" doc ordered by score.
>> Remember score is idf*tf as per Dougs mail (and as per my
>> hopefully correct understanding of these things). This page is of 
>> course more of a debugging
>> tool that something a normal user would see.  One possible area of 
>> improvement that jumped out at me after reviewing this table is using 
>> stemming, say, allowing more words in the generated query when 2 
>> words have the same stem.
>
>
> Actually, the analyzer should do that, shouldn't it? For example, I 
> have stemming analyzers for a variety of languages that both stem and 
> remove stop words - it seems silly to me to duplicate that 
> functionality when it's so easily provided by the analyzer. Given 
> that, I would suggest removing the stop word functionality from this class

Actually I realized this is a trickly and possibly counterintuitive issue.

In theory one might want the MoreLikeThis logic to use a *larger* stop 
word list than the Analyzer uses, even in the case where the Analyzer 
does not use any stop word list.

Reasoning is:
-- maybe you don't want Analyzer to have any stop words (so user can 
find the classic "to be or not to be" phrase) and the search index 
compression won't (in theory?) be affected by frequent stop words anyway
-- the stop words used by MoreLikeThis are a heuristic with 2 points 
behind them - the obvious (stop words
are not interesting in similarity) and the fact that they're there to 
minimize the expensive IndexReader.docFreq() calls, thus more stop words 
are fine to reduce docFreq() calls and let the query generator run faster

As an aside I sometimes use a list of ~500 English stop words from 
"SMART" (sorry, can't easily find the ref, though this might be close: 
http://citeseer.nj.nec.com/context/45797/0 ). I can contribute these if 
wanted.

> as it is not needed and only confuses things.
>
>
> Regards,
>
> Bruce Ritchie
> http://www.jivesoftware.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message