lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Whither Query Norm?
Date Wed, 25 Nov 2009 05:18:53 GMT
I'm late to the thread, and although it looks like the discussion is over, I'll inline a Q
for Jake.

>I should add in my $0.02 on whether to just get rid of queryNorm() altogether: 
>>>  -1 from me, even though it's confusing, because having that call there (somewhere,
at least) allows you to actually do compare scores across queries if you do the extra work
of properly normalizing the documents as well (at index time).
>>Do you have some references on this?  I'm interested in reading more on the subject.
 I've never quite been sold on how it is meaningful to compare scores and would like to read
more opinions.
>References on how people do this *with Lucene*, or just how this is done in general? 
There are lots of papers on fancy things which can be done, but I'm not sure where to point
you to start out.  The technique I'm referring to is really just the simplest possible thing
beyond setting your weights "by hand": let's assume you have a boolean OR query, Q, built
up out of sub-queries q_i (hitting, for starters, different fields, although you can overlap
as well with some more work), each with a set of weights (boosts) b_i, then if you have a
training corpus (good matches, bad matches, or ranked lists of matches in order of relevance
for the queries at hand), *and* scores (at the q_i level) which are comparable,

You mentioned this about 3 times in this thread (contrib/queries wants you!)
It sounds like you've done this before (with Lucene?).  But how, if the scores are not comparable,
and that's required for this "field boost learning/training" to work?


> then you can do a simple regression (linear or logistic, depending on whether you map
your final scores to a logit or not) on the w_i to fit for the best boosts to use.  What is
critical here is that scores from different queries are comparable.  If they're not, then
queries where the best document for a query scores 2.0 overly affect the training in comparison
to the queries where the best possible score is 0.5 (actually, wait, it's the reverse: you're
training to increase scores of matching documents, so the system tries to make that 0.5 scoring
document score much higher by raising boosts higher and higher, while the good matches already
scoring 2.0 don't need any more boosting, if that makes sense).
>There are of course far more complex "state of the art" training techniques, but probably
someone like Ted would be able to give a better list of references on where is easiest to
read those from.  But I can try to dredge up some places where I've read about doing this,
and post again later if I can find any.
>  -jake

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message