lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: New PhrasePrefixQuery.java
Date Wed, 20 Nov 2002 23:05:37 GMT
Konrad Scherer wrote:
>> I think it would be good to get this functionality into the Query 
>> parser.  There is currently a gap between what is trivially available 
>> in the query parser (strings with wildcard characters) and the 
>> PhrasePrefixQuery API (an array of terms).  What it seems to me is 
>> needed is just a utility method somewhere that expands a wildcarded 
>> string into an array of terms.  This is probably best done in 
>> PhrasePrefixQuery.scorer, when an IndexReader is available.  So the 
>> approach I would suggest is extending the API of PhrasePrefixQuery 
>> with a method like:
>>   PhrasePrefixQuery.addTermPrefix(Term term);
>> or
>>   PhrasePrefixQuery.addWildcardTerm(Term term);
>> where the term.text() contains either a term prefix or a wildcard 
>> pattern.  Then, in the scorer() implementation this can be expanded. 
>> PhrasePrefixQuery would then need to do some bookkeeping to identify 
>> which terms need expansion.
>>
>> Does this make sense?
> 
> Yes it makes sense, but there is a problem. To expand a wildcard, an 
> IndexReader is necessary. I choose the prepare method because then the 
> wildcard term can be expanded before the function sumOfSquaredWeights is 
> called.

Good point.  Keep in mind. with MultiSearcher, more than one IndexReader 
may be involved.  The correct thing to do is to take the union of the 
wildcard expansions across all readers.  More on this below.

> I must admit to not understanding the weighting system at all == I 
> haven't taken the time to think about it yet.

The value of sumOfSquaredWeights only alters the absolute value of 
scores, not relative ranking.  It is part of code that attempts to 
normalize scores based on the query, so that scores for different 
queries are somewhat comparable.  However absolute values of Lucene 
scores are not very meaningful anyway.  So it might be acceptable to 
take shortcuts with the value returned and just use a constant value for 
wildcarded terms.  Unfortunately, sumOfSquaredWeights also has a 
side-effect of computing the idf weight for the phrase, which does 
affect ranking.  So the correct solution is more complex.

The only way I can see to fix this correctly would be to change the 
Searchable, Query and Scorer APIs as follows:

1. Add a term expansion or term iteration method to Searchable, so that 
prefix and wildcard expansion can be done across all IndexReaders in a 
MultiSearcher before term weighting.

2. Change sumOfSquaredWeights implementations not to alter the query, 
but rather to just compute the returned value, using Searchable methods.

3. Move the normalize() method from Query to Scorer and eliminate the 
Query.prepare() method.

4. Change scorer implementations to compute idfs using Searchable methods.

Its a shame to compute the IDFs in both the query's sumOfSquared weights 
methods, and again in the scorer.  Perhaps Searchable implementations 
could cache docFreq() values so that this is not expensive.

As you can see, these are rather involved changes, not to be done 
lightly, but I think they would also fix some longstanding bugs.  In the 
short term, the simple approach might be to only operate correctly when 
an IndexSearcher is used, and not when a MultiSearcher is used.  Sigh. 
Longer term, I will add revising these APIs to my queue of tasks.

Doug


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message