Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: AW: Proposal: Statistical Stopword elimination
Date: Tue, 8 Apr 2003 13:33:16 +0200
Message-ID: <3B48940F2D7712428BD31A041A367DDC015974@lrrr>
Thread-Topic: Proposal: Statistical Stopword elimination
Thread-Index: AcL9M4UYY5+cdoL9Sw6wyrm4P0eeBwAhAE5w
From: "Karsten Konrad" <Karsten.Konrad@xtramind.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>


Hi Doug,

thank you for the comments.=20

>>
Note that, with a MultiSearcher, your implementation computed thresholds =

independently for each index, whereas this computes them globally over=20
all indexes, which is probably what you want.
>>

I am not so sure. When you search over indexes in different languages,
using one threshold is error-prone, as I might get less words to =
eliminate.
E.g., the German "das" identifies itself as a stopword by occuring in =
more
than 70% of all german texts. In a collection of indexes with several=20
languages, this percentage might be much lower. If I use this lower=20
percentage as a threshold, I might find myself eliminating content =
words!
But if I use separate index thresholds, I can be sure that "das" is =
eliminated=20
correctly from the query in the search over the German index by a =
threshold=20
of 65%, while it does not play any role in other indexes anyway :)

>>
Note also that this is all done with public APIs and requires no changes =

to the Lucene core.
>>

Don't I need to modify or write my own parser that creates the
modified queries instead of the default ones? Or is there a way to
programatically set the classes that the parser uses when building
up a query? That would be nice, for instance when you want to modify
the fuzzy matcher's fuzzy threshold and such things...

Also, you could then turn on and off parser features (like fuzzy search) =

that may be too expensive to use when you have many concurrent users.

>>
Please post the code.  If folks use it, then it's worthwhile and we=20
should probably include it with Lucene.  Ideally it should be simple to=20
do implement such things with the public APIs without having to build=20
more features into the core.
>>

I have found the following modification to Similarity useful, with which =
you
can use the frequency threshold for forcing term weights of 0. This =
often is=20
safer than filtering, and does not, unlike my previous suggestions, =
require=20
any changes to Lucene (which is, as we all know, an excellent tool. Is =
there=20
a Lucene fanclub somewhere that I can join?)

>>
  /** Expert: Hold the factor that defines from which document
   * frequency on a term is counted as zero weight. E.g., a
   * cut factor of 0.8 treats all documents that occur in more
   * than 80% of the documents as having zero weight. Default
   * is 1.0 which has no influence on the term weight scores.
   */
  private float limitFactor;

  /**
   * Computes a score idf factor for a simple term. The factor is null
   * if the document frequency of the term is higher or equal than
   * ((#maxdocuments+1)*limitFactor);
   *
   * <p>The default implementation is:<pre>
   *   return idf(searcher.docFreq(term), searcher.maxDoc());
   * </pre>
   *
   * Note that {@link Searcher#maxDoc()} is used instead of {@link
   * IndexReader#numDocs()} because it is proportional to {@link
   * Searcher#docFreq(Term)} , i.e., when one is inaccurate, so is the =
other,
   * and in the same direction.
   * @return a score factor for the term
   * @param term the term in question
   * @param searcher the document collection being searched
   * @throws IOException thrown when access to index failed.
   */
  public float idf(Term term, Searcher searcher) throws IOException {
    int docFreq =3D searcher.docFreq(term);
    int max =3D searcher.maxDoc();
    float idf =3D 0.0f;
    if (docFreq < (max+1)*getLimitFactor()) {
      idf =3D idf(docFreq, max);
      // getFrequencies().maximizeFrequency(term.text(), idf);
    }
   =20
    return idf;
  }

  /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>.
   * @param sumOfSquaredWeights the sum of the square weights of the =
query.
   * @return a query norm.
   */
  public float queryNorm(float sumOfSquaredWeights) {
    float weight =3D 0.0f;
    if (sumOfSquaredWeights > 0.0) {
      weight =3D (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }
    return weight;
  }

  /** Getter for property limitFactor.
   * @return Value of property limitFactor.
   *
   */
  public float getLimitFactor() {
    return limitFactor;
  }
 =20
  /** Setter for property limitFactor.
   * @param limitFactor New value of property limitFactor.
   *
   */
  public void setLimitFactor(float limitFactor) {
    this.limitFactor =3D limitFactor;
  }
>>

In the version I am using, the idf() method collects all terms in a
Object to int hashtable such that we can more easily highlight later on. =

The line is uncommented above.

Regards,=20

Karsten Konrad


-----Urspr=FCngliche Nachricht-----
Von: Doug Cutting [mailto:cutting@lucene.com]
Gesendet: Montag, 7. April 2003 20:29
An: Lucene Developers List
Betreff: Re: Proposal: Statistical Stopword elimination


Karsten Konrad wrote:
> For this, I have introduced a frequency limit factor into
> Similarity and test for excessively high document frequencies
> in the TermQuery.
 >
> My questions:
>=20
> (1) Is there some more elegant way of doing this?

I think you could do this more simply by creating a subclass of=20
TermQuery and overriding createWeight, with something like:

   protected Weight createWeight(Searcher searcher) {
     float maxDoc =3D searcher.maxDoc();
     float ratio =3D searcher.docFreq(getTerm()) / maxDoc;
     float threshold =3D
        (ThresholdSimilarity)getSimilarity()).getThreshold());
     if (ratio >=3D threshold)
       return super.createWeight(searcher);
     else
       return new NullWeight();    // a no-op weight implementation
   }

You'd also need to define ThresholdSimilarity as a subclass of=20
Similarity or DefaultSimilarity that has a threshold, and define=20
NullWeight as a Weight implementation whose Scorer does nothing.

Note that, with a MultiSearcher, your implementation computed thresholds =

independently for each index, whereas this computes them globally over=20
all indexes, which is probably what you want.

Note also that this is all done with public APIs and requires no changes =

to the Lucene core.

 > E.g., access to the docFreq is done again in the TermScorer
 > and I would like to remove this redundancy.

I doubt that will substantially impact performance.  If it does, it=20
would be easy to add a small cache into the IndexReader.  However=20
someone tried this once and found that it didn't make much difference.

> (2) Is this a worthwhile contribution to Lucene's features in your =
opinion?

Please post the code.  If folks use it, then it's worthwhile and we=20
should probably include it with Lucene.  Ideally it should be simple to=20
do implement such things with the public APIs without having to build=20
more features into the core.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org