lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tavi Nathanson <tavi.nathan...@gmail.com>
Subject Scoring: Precedent for a Better, Less Fragile Approach?
Date Mon, 07 Feb 2011 17:29:16 GMT

Hey everyone,

I have a question about Lucene/Solr scoring in general. It really feels like
a wobbly house of cards that falls down whenever I make the slightest tweak.
There are many factors at play in Lucene scoring: they're all fighting with
each other, and very often one will completely dominate everything else,
when that may not really be the intention.

** The question: might there be a way to enforce strict requirements that
certain factors are higher priority than other factors, and/or certain
factors shouldn't overtake other factors? Perhaps a set of rules where one
factor is considered before even examining another factor? Tuning boost
numbers around and hoping for the best seems imprecise and very fragile. **

To make this more concrete, an example:

We previously added the scores of multi-field matches together via an OR,
so: score(query "apple") = score(field1:apple) + score(field2:apple). I
changed that to be more in-line with DisMaxParser, namely a max: score(query
"apple") = max(score(field1:apple), score(field2:apple)). I also modified
coord such that coord would only consider actual unique terms ("apple" vs.
"orange"), rather than terms across multiple fields (field1:apple vs.
field2:apple).

This seemed like a good idea, but it actually introduced a bug that was
previously hidden. Suddenly, documents matching "apple" in the title and
*nothing* in the body were being boosted over "apple" in the title and
"apple" in the body! I investigated, and it was due to lengthNorm:
previously, documents matching "apple" in both title and body were getting
higher scores thanks to to summing the field scores (vs. max) as well as a
higher coord score. Now that they were no longer getting these boosts, which
was beneficial in many respects, the playing field was leveled. And this
leveling of the playing field allowed lengthNorm to dominate everything
else.

Any help would be much appreciated. Thanks!

Tavi
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Scoring-Precedent-for-a-Better-Less-Fragile-Approach-tp2445112p2445112.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message