lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolf Siberski <siber...@l3s.de>
Subject Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Fri, 18 Feb 2005 12:31:31 GMT
Doug Cutting wrote:
> Christoph Goller wrote:
> 
>> The similarity specified for the search has to be modified so that both
>> idf(...) AND  queryNorm(...) always return 1 and as you say everything
>> except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts
>> of the rewritten query. coord/tf/sloppyFreq computation would be done
>> locally by the Searchables as specified for this search.
>>
>> So the changes for the MultiSearcher bug would remain locally in 
>> MultiSearcher.
>> I think this would be a very clean solution. What do others think?
> 
> This sounds good to me!

It first sounded good to me, too. However, as I said in my previous mail,
it would lead to duplication of a significant part of the weights code
in MultiSearcher.

Now I found another solution which requires more changes, but IMHO is
much cleaner:
- when a query computes its Weight, it caches it in an attribute
- a query can be 'frozen'. A frozen query always returns the cached
   Weight when calling Query.weight().
- The MultiSearcher query processing is done in the following steps:
    1. rewrite query
    2. extract necessary terms
    3. collect dfs for these terms from the Searchables
    4. create query weights using aggregate dfs and *freeze query*.
    5. distribute weighted and frozen query to Searchables.
    6. merge results

I've submitted a new complete patch which implements this approach.

This approach requires that weights can be serialized. Interestingly,
Weight already implements Serializable, but the current implementation
doesn't work for all weight classes. The reason is that some weights
hold a reference to a searcher which is of course not serializable.
We can't make it transient either, because this searcher is the source
of the Similarity needed by scorers.

On closer look it turned out that the searcher is used only for two
things: as source for a Similarity, and as docFreqs&maxDoc source.
docFreq&maxDoc are only necessary to initialize the weights, but not
needed by scorers. So instead of providing the Searcher, I now provide
a Similarity and a DocFreqSource to the weights. Only the Similarity is
stored by weights. As (IMHO) positive side effect, Similarity got rid of
Searcher dependencies, which leads to a better split of responsibilities:
- Similarity only provides scoring formulas
- Searcher (rsp. DocFreqSource) provides the raw data (tf/df/maxDoc)
This change affects quite a few classes (because the createWeight() signature
is changed), but the modifications are pretty straightforward.

 From my point of view, the patch submitted now is a sound solution
for Bug 31841 (at least I like it :-) ).
The next thing which IMHO needs to be done is a review by someone else.
As always, all comments are appreciated.

--Wolf








---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message