Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 98903 invoked from network); 20 Nov 2002 23:03:37 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 20 Nov 2002 23:03:37 -0000 Received: (qmail 4791 invoked by uid 97); 20 Nov 2002 23:04:41 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 4775 invoked by uid 97); 20 Nov 2002 23:04:41 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 4756 invoked by uid 98); 20 Nov 2002 23:04:40 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <3DDC1541.2050703@lucene.com> Date: Wed, 20 Nov 2002 15:05:37 -0800 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: New PhrasePrefixQuery.java References: <5.1.0.14.0.20021118142258.00a3cec0@mailbox.uottawa.ca> <5.1.0.14.0.20021120142201.00a28bc0@mailbox.uottawa.ca> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Konrad Scherer wrote: >> I think it would be good to get this functionality into the Query >> parser. There is currently a gap between what is trivially available >> in the query parser (strings with wildcard characters) and the >> PhrasePrefixQuery API (an array of terms). What it seems to me is >> needed is just a utility method somewhere that expands a wildcarded >> string into an array of terms. This is probably best done in >> PhrasePrefixQuery.scorer, when an IndexReader is available. So the >> approach I would suggest is extending the API of PhrasePrefixQuery >> with a method like: >> PhrasePrefixQuery.addTermPrefix(Term term); >> or >> PhrasePrefixQuery.addWildcardTerm(Term term); >> where the term.text() contains either a term prefix or a wildcard >> pattern. Then, in the scorer() implementation this can be expanded. >> PhrasePrefixQuery would then need to do some bookkeeping to identify >> which terms need expansion. >> >> Does this make sense? > > Yes it makes sense, but there is a problem. To expand a wildcard, an > IndexReader is necessary. I choose the prepare method because then the > wildcard term can be expanded before the function sumOfSquaredWeights is > called. Good point. Keep in mind. with MultiSearcher, more than one IndexReader may be involved. The correct thing to do is to take the union of the wildcard expansions across all readers. More on this below. > I must admit to not understanding the weighting system at all == I > haven't taken the time to think about it yet. The value of sumOfSquaredWeights only alters the absolute value of scores, not relative ranking. It is part of code that attempts to normalize scores based on the query, so that scores for different queries are somewhat comparable. However absolute values of Lucene scores are not very meaningful anyway. So it might be acceptable to take shortcuts with the value returned and just use a constant value for wildcarded terms. Unfortunately, sumOfSquaredWeights also has a side-effect of computing the idf weight for the phrase, which does affect ranking. So the correct solution is more complex. The only way I can see to fix this correctly would be to change the Searchable, Query and Scorer APIs as follows: 1. Add a term expansion or term iteration method to Searchable, so that prefix and wildcard expansion can be done across all IndexReaders in a MultiSearcher before term weighting. 2. Change sumOfSquaredWeights implementations not to alter the query, but rather to just compute the returned value, using Searchable methods. 3. Move the normalize() method from Query to Scorer and eliminate the Query.prepare() method. 4. Change scorer implementations to compute idfs using Searchable methods. Its a shame to compute the IDFs in both the query's sumOfSquared weights methods, and again in the scorer. Perhaps Searchable implementations could cache docFreq() values so that this is not expensive. As you can see, these are rather involved changes, not to be done lightly, but I think they would also fix some longstanding bugs. In the short term, the simple approach might be to only operate correctly when an IndexSearcher is used, and not when a MultiSearcher is used. Sigh. Longer term, I will add revising these APIs to my queue of tasks. Doug -- To unsubscribe, e-mail: For additional commands, e-mail: