lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <Karsten.Kon...@xtramind.com>
Subject AW: Proposal: Statistical Stopword elimination
Date Tue, 08 Apr 2003 11:33:16 GMT

Hi Doug,

thank you for the comments. 

>>
Note that, with a MultiSearcher, your implementation computed thresholds 
independently for each index, whereas this computes them globally over 
all indexes, which is probably what you want.
>>

I am not so sure. When you search over indexes in different languages,
using one threshold is error-prone, as I might get less words to eliminate.
E.g., the German "das" identifies itself as a stopword by occuring in more
than 70% of all german texts. In a collection of indexes with several 
languages, this percentage might be much lower. If I use this lower 
percentage as a threshold, I might find myself eliminating content words!
But if I use separate index thresholds, I can be sure that "das" is eliminated 
correctly from the query in the search over the German index by a threshold 
of 65%, while it does not play any role in other indexes anyway :)

>>
Note also that this is all done with public APIs and requires no changes 
to the Lucene core.
>>

Don't I need to modify or write my own parser that creates the
modified queries instead of the default ones? Or is there a way to
programatically set the classes that the parser uses when building
up a query? That would be nice, for instance when you want to modify
the fuzzy matcher's fuzzy threshold and such things...

Also, you could then turn on and off parser features (like fuzzy search) 
that may be too expensive to use when you have many concurrent users.

>>
Please post the code.  If folks use it, then it's worthwhile and we 
should probably include it with Lucene.  Ideally it should be simple to 
do implement such things with the public APIs without having to build 
more features into the core.
>>

I have found the following modification to Similarity useful, with which you
can use the frequency threshold for forcing term weights of 0. This often is 
safer than filtering, and does not, unlike my previous suggestions, require 
any changes to Lucene (which is, as we all know, an excellent tool. Is there 
a Lucene fanclub somewhere that I can join?)

>>
  /** Expert: Hold the factor that defines from which document
   * frequency on a term is counted as zero weight. E.g., a
   * cut factor of 0.8 treats all documents that occur in more
   * than 80% of the documents as having zero weight. Default
   * is 1.0 which has no influence on the term weight scores.
   */
  private float limitFactor;

  /**
   * Computes a score idf factor for a simple term. The factor is null
   * if the document frequency of the term is higher or equal than
   * ((#maxdocuments+1)*limitFactor);
   *
   * <p>The default implementation is:<pre>
   *   return idf(searcher.docFreq(term), searcher.maxDoc());
   * </pre>
   *
   * Note that {@link Searcher#maxDoc()} is used instead of {@link
   * IndexReader#numDocs()} because it is proportional to {@link
   * Searcher#docFreq(Term)} , i.e., when one is inaccurate, so is the other,
   * and in the same direction.
   * @return a score factor for the term
   * @param term the term in question
   * @param searcher the document collection being searched
   * @throws IOException thrown when access to index failed.
   */
  public float idf(Term term, Searcher searcher) throws IOException {
    int docFreq = searcher.docFreq(term);
    int max = searcher.maxDoc();
    float idf = 0.0f;
    if (docFreq < (max+1)*getLimitFactor()) {
      idf = idf(docFreq, max);
      // getFrequencies().maximizeFrequency(term.text(), idf);
    }
    
    return idf;
  }

  /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>.
   * @param sumOfSquaredWeights the sum of the square weights of the query.
   * @return a query norm.
   */
  public float queryNorm(float sumOfSquaredWeights) {
    float weight = 0.0f;
    if (sumOfSquaredWeights > 0.0) {
      weight = (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }
    return weight;
  }

  /** Getter for property limitFactor.
   * @return Value of property limitFactor.
   *
   */
  public float getLimitFactor() {
    return limitFactor;
  }
  
  /** Setter for property limitFactor.
   * @param limitFactor New value of property limitFactor.
   *
   */
  public void setLimitFactor(float limitFactor) {
    this.limitFactor = limitFactor;
  }
>>

In the version I am using, the idf() method collects all terms in a
Object to int hashtable such that we can more easily highlight later on. 
The line is uncommented above.

Regards, 

Karsten Konrad





-----Urspr√ľngliche Nachricht-----
Von: Doug Cutting [mailto:cutting@lucene.com]
Gesendet: Montag, 7. April 2003 20:29
An: Lucene Developers List
Betreff: Re: Proposal: Statistical Stopword elimination


Karsten Konrad wrote:
> For this, I have introduced a frequency limit factor into
> Similarity and test for excessively high document frequencies
> in the TermQuery.
 >
> My questions:
> 
> (1) Is there some more elegant way of doing this?

I think you could do this more simply by creating a subclass of 
TermQuery and overriding createWeight, with something like:

   protected Weight createWeight(Searcher searcher) {
     float maxDoc = searcher.maxDoc();
     float ratio = searcher.docFreq(getTerm()) / maxDoc;
     float threshold =
        (ThresholdSimilarity)getSimilarity()).getThreshold());
     if (ratio >= threshold)
       return super.createWeight(searcher);
     else
       return new NullWeight();    // a no-op weight implementation
   }

You'd also need to define ThresholdSimilarity as a subclass of 
Similarity or DefaultSimilarity that has a threshold, and define 
NullWeight as a Weight implementation whose Scorer does nothing.

Note that, with a MultiSearcher, your implementation computed thresholds 
independently for each index, whereas this computes them globally over 
all indexes, which is probably what you want.

Note also that this is all done with public APIs and requires no changes 
to the Lucene core.

 > E.g., access to the docFreq is done again in the TermScorer
 > and I would like to remove this redundancy.

I doubt that will substantially impact performance.  If it does, it 
would be easy to add a small cache into the IndexReader.  However 
someone tried this once and found that it didn't make much difference.

> (2) Is this a worthwhile contribution to Lucene's features in your opinion?

Please post the code.  If folks use it, then it's worthwhile and we 
should probably include it with Lucene.  Ideally it should be simple to 
do implement such things with the public APIs without having to build 
more features into the core.

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message