Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 86501 invoked from network); 8 Apr 2003 11:33:22 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 8 Apr 2003 11:33:22 -0000 Received: (qmail 16212 invoked by uid 97); 8 Apr 2003 11:35:13 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 16205 invoked from network); 8 Apr 2003 11:35:12 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 8 Apr 2003 11:35:12 -0000 Received: (qmail 86223 invoked by uid 500); 8 Apr 2003 11:33:19 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 86204 invoked from network); 8 Apr 2003 11:33:19 -0000 Received: from mousepad.xtramind.dfki.de (134.96.191.5) by daedalus.apache.org with SMTP; 8 Apr 2003 11:33:19 -0000 Received: from localhost (localhost [127.0.0.1]) by mousepad.xtramind.dfki.de (Postfix) with ESMTP id 095ED7D02 for ; Tue, 8 Apr 2003 13:33:18 +0200 (MEST) Received: from omicron.win.xtramind.dfki.de (omicron.xtramind.dfki.de [192.168.4.37]) by mousepad.xtramind.dfki.de (Postfix) with ESMTP id E7A757D00 for ; Tue, 8 Apr 2003 13:33:15 +0200 (MEST) X-MimeOLE: Produced By Microsoft Exchange V6.0.6249.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: AW: Proposal: Statistical Stopword elimination Date: Tue, 8 Apr 2003 13:33:16 +0200 Message-ID: <3B48940F2D7712428BD31A041A367DDC015974@lrrr> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Proposal: Statistical Stopword elimination Thread-Index: AcL9M4UYY5+cdoL9Sw6wyrm4P0eeBwAhAE5w From: "Karsten Konrad" To: "Lucene Developers List" X-Virus-Scanned: by AMaViS with Sophos Sweep X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi Doug, thank you for the comments.=20 >> Note that, with a MultiSearcher, your implementation computed thresholds = independently for each index, whereas this computes them globally over=20 all indexes, which is probably what you want. >> I am not so sure. When you search over indexes in different languages, using one threshold is error-prone, as I might get less words to = eliminate. E.g., the German "das" identifies itself as a stopword by occuring in = more than 70% of all german texts. In a collection of indexes with several=20 languages, this percentage might be much lower. If I use this lower=20 percentage as a threshold, I might find myself eliminating content = words! But if I use separate index thresholds, I can be sure that "das" is = eliminated=20 correctly from the query in the search over the German index by a = threshold=20 of 65%, while it does not play any role in other indexes anyway :) >> Note also that this is all done with public APIs and requires no changes = to the Lucene core. >> Don't I need to modify or write my own parser that creates the modified queries instead of the default ones? Or is there a way to programatically set the classes that the parser uses when building up a query? That would be nice, for instance when you want to modify the fuzzy matcher's fuzzy threshold and such things... Also, you could then turn on and off parser features (like fuzzy search) = that may be too expensive to use when you have many concurrent users. >> Please post the code. If folks use it, then it's worthwhile and we=20 should probably include it with Lucene. Ideally it should be simple to=20 do implement such things with the public APIs without having to build=20 more features into the core. >> I have found the following modification to Similarity useful, with which = you can use the frequency threshold for forcing term weights of 0. This = often is=20 safer than filtering, and does not, unlike my previous suggestions, = require=20 any changes to Lucene (which is, as we all know, an excellent tool. Is = there=20 a Lucene fanclub somewhere that I can join?) >> /** Expert: Hold the factor that defines from which document * frequency on a term is counted as zero weight. E.g., a * cut factor of 0.8 treats all documents that occur in more * than 80% of the documents as having zero weight. Default * is 1.0 which has no influence on the term weight scores. */ private float limitFactor; /** * Computes a score idf factor for a simple term. The factor is null * if the document frequency of the term is higher or equal than * ((#maxdocuments+1)*limitFactor); * *

The default implementation is:

   *   return idf(searcher.docFreq(term), searcher.maxDoc());
   * 
* * Note that {@link Searcher#maxDoc()} is used instead of {@link * IndexReader#numDocs()} because it is proportional to {@link * Searcher#docFreq(Term)} , i.e., when one is inaccurate, so is the = other, * and in the same direction. * @return a score factor for the term * @param term the term in question * @param searcher the document collection being searched * @throws IOException thrown when access to index failed. */ public float idf(Term term, Searcher searcher) throws IOException { int docFreq =3D searcher.docFreq(term); int max =3D searcher.maxDoc(); float idf =3D 0.0f; if (docFreq < (max+1)*getLimitFactor()) { idf =3D idf(docFreq, max); // getFrequencies().maximizeFrequency(term.text(), idf); } =20 return idf; } /** Implemented as 1/sqrt(sumOfSquaredWeights). * @param sumOfSquaredWeights the sum of the square weights of the = query. * @return a query norm. */ public float queryNorm(float sumOfSquaredWeights) { float weight =3D 0.0f; if (sumOfSquaredWeights > 0.0) { weight =3D (float)(1.0 / Math.sqrt(sumOfSquaredWeights)); } return weight; } /** Getter for property limitFactor. * @return Value of property limitFactor. * */ public float getLimitFactor() { return limitFactor; } =20 /** Setter for property limitFactor. * @param limitFactor New value of property limitFactor. * */ public void setLimitFactor(float limitFactor) { this.limitFactor =3D limitFactor; } >> In the version I am using, the idf() method collects all terms in a Object to int hashtable such that we can more easily highlight later on. = The line is uncommented above. Regards,=20 Karsten Konrad -----Urspr=FCngliche Nachricht----- Von: Doug Cutting [mailto:cutting@lucene.com] Gesendet: Montag, 7. April 2003 20:29 An: Lucene Developers List Betreff: Re: Proposal: Statistical Stopword elimination Karsten Konrad wrote: > For this, I have introduced a frequency limit factor into > Similarity and test for excessively high document frequencies > in the TermQuery. > > My questions: >=20 > (1) Is there some more elegant way of doing this? I think you could do this more simply by creating a subclass of=20 TermQuery and overriding createWeight, with something like: protected Weight createWeight(Searcher searcher) { float maxDoc =3D searcher.maxDoc(); float ratio =3D searcher.docFreq(getTerm()) / maxDoc; float threshold =3D (ThresholdSimilarity)getSimilarity()).getThreshold()); if (ratio >=3D threshold) return super.createWeight(searcher); else return new NullWeight(); // a no-op weight implementation } You'd also need to define ThresholdSimilarity as a subclass of=20 Similarity or DefaultSimilarity that has a threshold, and define=20 NullWeight as a Weight implementation whose Scorer does nothing. Note that, with a MultiSearcher, your implementation computed thresholds = independently for each index, whereas this computes them globally over=20 all indexes, which is probably what you want. Note also that this is all done with public APIs and requires no changes = to the Lucene core. > E.g., access to the docFreq is done again in the TermScorer > and I would like to remove this redundancy. I doubt that will substantially impact performance. If it does, it=20 would be easy to add a small cache into the IndexReader. However=20 someone tried this once and found that it didn't make much difference. > (2) Is this a worthwhile contribution to Lucene's features in your = opinion? Please post the code. If folks use it, then it's worthwhile and we=20 should probably include it with Lucene. Ideally it should be simple to=20 do implement such things with the public APIs without having to build=20 more features into the core. Doug --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org