lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Proposal: Statistical Stopword elimination
Date Mon, 07 Apr 2003 18:28:36 GMT
Karsten Konrad wrote:
> For this, I have introduced a frequency limit factor into
> Similarity and test for excessively high document frequencies
> in the TermQuery.
> My questions:
> (1) Is there some more elegant way of doing this?

I think you could do this more simply by creating a subclass of 
TermQuery and overriding createWeight, with something like:

   protected Weight createWeight(Searcher searcher) {
     float maxDoc = searcher.maxDoc();
     float ratio = searcher.docFreq(getTerm()) / maxDoc;
     float threshold =
     if (ratio >= threshold)
       return super.createWeight(searcher);
       return new NullWeight();    // a no-op weight implementation

You'd also need to define ThresholdSimilarity as a subclass of 
Similarity or DefaultSimilarity that has a threshold, and define 
NullWeight as a Weight implementation whose Scorer does nothing.

Note that, with a MultiSearcher, your implementation computed thresholds 
independently for each index, whereas this computes them globally over 
all indexes, which is probably what you want.

Note also that this is all done with public APIs and requires no changes 
to the Lucene core.

 > E.g., access to the docFreq is done again in the TermScorer
 > and I would like to remove this redundancy.

I doubt that will substantially impact performance.  If it does, it 
would be easy to add a small cache into the IndexReader.  However 
someone tried this once and found that it didn't make much difference.

> (2) Is this a worthwhile contribution to Lucene's features in your opinion?

Please post the code.  If folks use it, then it's worthwhile and we 
should probably include it with Lucene.  Ideally it should be simple to 
do implement such things with the public APIs without having to build 
more features into the core.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message