lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolf Siberski <siber...@l3s.de>
Subject Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Thu, 10 Feb 2005 15:37:55 GMT
Christoph Goller wrote:
 > Chuck Williams wrote:
 >> score(query, doc) =
 >>   coord*queryNorm*
 >>     sum[ term in query :
 >> idf(term)*boost(term)*idf(term)*tf(term, doc)*docNorm(doc)
 >>        ]
 >>
 >> where queryNorm = 1/sum[ term in query : (boost(term)*idf(term))^2 ]
 >> [...] The MultiSearcher boost could
 >> be all terms in the formula above except for tf(term,doc)*docNorm(doc).
 >
 >
 > Great. You are right Chuck.
 > The similarity specified for the search has to be modified so that both
 > idf(...) AND  queryNorm(...) always return 1 and as you say everything
 > except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts
 > of the rewritten query. coord/tf/sloppyFreq computation would be done
 > locally by the Searchables as specified for this search.
 >
 > So the changes for the MultiSearcher bug would remain locally in MultiSearcher.
 > I think this would be a very clean solution. What do others think?

I have just added a new version of the patch to Bugzilla which
goes in this direction. Everything except the boost readjustment
is implemented now (there is a temporary replacement to make the
patch work anyway).

There are two reasons why I didn't yet implement the boost factor
adjustment as proposed by Christoph:
1. I'm still in the process of acquiring a detailed understanding
    of how and where all the weighting is happening.
2. (This is the more important point.) While first I considered it
    a good idea to use the boost as correction factor, now I'm not
    so sure anymore. When I started the implementation I recognized
    soon that I was essentially repeating the weight/scorer preparation
    process outside of the query.
    In other words, I was duplicating program logic. Thus, if the
    Lucene query evaluation process changes in the future, the
    MultiSearcher will always have to be maintained, and this smells
    bad (as the XP-ers would say).

Now, here is my suggestion what to do instead: if we can precalculate
this factor before evaluating each single document, why don't we do
that in all cases? I'm imagining something like a second rewrite step
which prepares the weights as outlined by Chuck and Christoph,
and is done before every scoring. In the non-distributed case
this step would just be executed before creating the scorers,
in the distributed case it would be executed by the MultiSearcher,
and then the prepared query would be distributed to all searchables.

Would this be a reasonable approach? If yes, is someone more
familiar with query/weight internals willing to implement it?
I could try to do it, but it seems that this task really needs
to touch the Lucene 'kernel', and I feel rather as a newbie in this area.

If someone wants to take a look at the patch, the best start
would be MultiSearcher.prepareQueries(). I'd appreciate any comments
regarding the patch. For example, I'm not too happy with the introduction
of the Query.addTerms() method, but don't know how else to get
the required information.

--Wolf

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message