lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <>
Subject Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring
Date Wed, 20 Oct 2004 05:56:56 GMT
Hi everybody,

Although there doesn't seem to be much interest in this I have one
significant improvement to the below and a couple observations that
clarify the situation.

To illustrate the problem better normalization is intended to address,
in my current application for BooleanQuery's of two terms, I frequently
get a top score of 1.0 when only one of the terms is matched.  1.0
should indicate a "perfect match".  I'd like set my UI up to present the
results differently depending on how good the different results are
(e.g., showing a visual indication of result quality, dropping the
really bad results entirely, and segregating the good results from other
only vaguely relevant results).  To build this kind of "intelligence"
into the UI, I certainly need to know whether my top result matched all
query terms, or only half of them.  I'd like to have the score tell me
directly how good the matches are.  The current normalization does not
achieve this.

The intent of a new normalization scheme is to preserve the current
scoring in the following sense:  the ratio of the nth result's score to
the best result's score remains the same.  Therefore, the only question
is what normalization factor to apply to all scores.  This reduces to a
very specific question that determines the entire normalization -- what
should the score of the best matching result be?

The mechanism below has this property, i.e. it keeps the current score
ratios, except that I removed one idf term for reasons covered earlier
(this better reflects the current empirically best scoring algorithms).
However, removing an idf is a completely separate issue.  The improved
normalization is independent of whether or not that change is made.

For the central question of what the top score should be, upon
reflection, I don't like the definition below.  It defined the top score
as (approximately) the percentage of query terms matched by the top
scoring result.  A better conceptual definition is to use a weighted
average based on the boosts.  I.e., downward propagate all boosts to the
underlying terms (or phrases).  Secifically, the "net boost" of a term
is the product of the direct boost of the term and all boosts applied to
encompassing clauses.  Then the score of the top result becomes the sum
of the net boosts of its matching terms divided by the sum of the net
boosts of all query terms.

This definition is a refinement of the original proposal below, and it
could probably be further refined if some impact of the tf, idf and/or
lengthNorm was desired in determining the top score.  These other
factors seems to be harder to normalize for, although I've thought of
some simple approaches; e.g., assume the unmatched terms in the top
result have values for these three factors that are the average of the
values of the matched terms, then apply exactly the same concept of
actual score divided by theorectical maximum score.  That would
eliminate any need to maintain maximum value statistics in the index.

As an example of the simple boost-based normalization, for the query
  ((a^2 b)^3 (c d^2))
the net boosts are:
  a --> 6
  b --> 3
  c --> 1
  d --> 2

So if a and b matched, but not c and d, in the top scoring result, its
score would be 0.75.  The normalizer would be 0.75/(current score except
for the current normalization).  This normalizer would be applied to all
current scores (minus normalization) to create the normalized scores.

For simple query (a b), if only one of the terms matched in the top
result, then its score would be 0.5, vs. 1.0 or many other possible
scores today.

In addition to enabling more "intelligent" UI's that communicate the
quality of results to end-users, the proposal below also extends the
explain() mechanism to fully explain the final normalized score.
However, that change is also independent -- it could be done with the
current scoring.

Am I the only one who would like to see better normalization in Lucene?
Does anybody have a better approach?

If you've read this far, thanks for indulging me on this.  I would love
to see this or something with similar properties in Lucene.  I really
just want to build my app, but as stated below would write and
contribute this if there is interest in putting it in, and nobody else
wants to write it.  Please let me know what you think one way or the



  > -----Original Message-----
  > From: Chuck Williams
  > Sent: Monday, October 18, 2004 7:04 PM
  > To: 'Lucene Developers List'
  > Subject: RE: idf and explain(), was Re: Search and Scoring
  > Doug Cutting wrote:
  >   > If this is a big issue for you, as it seems it is, please submit
  > patch
  >   > to optionally disable score normalization in
  > and:
  >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed for each
  > document
  >   > each time a document is added to the collection, since
  > is
  >   > dependent on global term statistics.  This is prohibitivly
  >   > Research has also demonstrated that such cosine normalization
  >   > somewhat inferior results (e.g., Singhal's pivoted length
  > normalization).
  > I'm willing to write, test and contribute code to address the
  > normalization issue, i.e. to yield scores in [0, 1] that are
  > across searches.  Unfortunately, this is considerably more involved
  > just optionally eliminating the current normalization in Hits.
  > undertaking this, I'd like to see if there is agreement in principle
  > that this is a good idea, and that my specific proposal below is the
  > right way to go.  Also, I'd like to make sure I've correctly
  > the constraints on writing code to be incorporated into Lucene.
  > After looking at this in more detail I agree that the cosine
  > normalization is not the way to go, because of both efficiency and
  > effectiveness considerations.  A conceptual approach that would be
  > efficient, relatively easy to implement, and seems to have at least
  > reasonable behavior would be to define the top scoring match to have
  > score roughly equal to the percentage of query terms it matches (its
  > "netCoord").  Scores below the top hit would be reduced based on the
  > ratio of their raw scores to the raw score of the top hit
  > all of the current score factors, except that I'd like to remove one
  > the idf factors, as discussed earlier).
  > For a couple simple cases:
  >   a) the top match for a single term query would always have a score
  > 1.0,
  >   b) the top scoring match for a BooleanQuery using
  > with all non-prohibited TermQuery clauses would have a score of m/n,
  > where the hit matches m of the n terms.
  > This isn't optimal, but seems much better than the current
  > Consider two single-term queries, s and t.  If s matches more
  > than t in its top hit (e.g., a higher tf in a shorter field), it
  > be best if the top score of s was greater score than top score of t.
  > But this kind of normalization would require keeping additional
  > statistics that so far as I know are not currently in the index,
  > the maximum tf for terms and the minimum length for fields.  These
  > be expensive to update on deletes.  Also, normalizing by such
  > could yield lower than subjectively reasonable scores in most cases,
  > it's not clear it would be better.
  > The semantics above are at least clean, easy to understand, and
  > what seems to me is the most important motivation to do this:
  > an application to use simple thresholding to segregate likely-to-be-
  > relevant hits from likely-to-be-irrelevant hits.
  > More specifically, for a BooleanQuery of TermQuery's the resulting
  > functions would be:
  > BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
  > baseScore(sbq, doc) =
  >   sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
  >            lengthNorm(tqi.term.field, doc)
  > rawScore(sbq, doc) = coord(sbq, doc) * baseScore
  > norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
  > score(sbq, doc) = rawScore * norm
  > rawScore's can be computed in the Scorer.score() methods and
  > used to sort the hits (e.g., in the instance method for collect() in
  > HitCollector in  If the top scoring hit
  > not have the highest baseScore, then its score could be less that
  > coord; this seems desirable.  These formulas imply that no result
  > can be larger than its coord, so if coord is well-defined (always
  > between 0 and 1) then all results will be normalized between 0 and
  > In general, the netCoord, which takes the place of coord in the
  > case above, needs to be defined for any query.  Conceptually, this
  > should be the total percentage of query terms matched by the
  > It must be recursively computable from the subquery structure and
  > overridable for specific Query types (e.g., to support specialized
  > coords, like one that is always 1.0 as is useful in multi-field
  > searching).  Suitable default definitions for TermQuery and
  > are:
  > TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
  > BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * ci.netCoord
  > This is not quite percentage of terms matched; e.g., consider a
  > BooleanQuery with two clauses, one of which is a BooleanQuery of 3
  > and the other which is just a term.  However, it doesn't seem to be
  > unreasonable for a BooleanQuery to state that its clauses are
  > important, and this is consistent with the current coord behavior.
  > BooleanQuery.netCoord could be overridden for special cases, like
  > pure disjunction I use in my app for field expansions:
  > MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
  > Implementing this would proceed along these lines:
  >   1.  For backwards compatibility, add some kind of newScoring
  > setting.
  >   2.  Update all of these places to behave as indicated if
  >     a.  Change Query.weight() to not do any normalization (no call
  > sumOfSquaredWeights(), Similarity.queryNorm() or normalize()).
  >     b.  Update all Query.weight classes to set their value according
  > the terms in the score formula above that don't involve the document
  > (e.g., boost*idf for TermQuery).
  >     c.  Add suitable netCoord definitions to all Scorer classes.
  >     d.  Update all Scorer.score() methods to compute the rawScore as
  > above.
  >     e.  Add the maximum baseScore as a field kept on TopDocs,
  > in the HitCollector's.
  >     f.  Change the normalization in Hits to always divide every raw
  > score by the maximum baseScore.
  >     g.  Update all of the current explain() methods to be consistent
  > with this scoring, and to either report the rawScore, or to report
  > final score if the normalization factor is provided.
  >     h.  Add Hits.explain() (or better extend Searcher so that it
  > the Hits for use in Searcher.explain()) to call the new explain
  > variation with the normalization factor so that final scores are
  > explained.
  > If this seems like a good idea, please let me know.  I'm sure there
  > details I've missed that would come out during coding and testing.
  > the value of this is dependent on how reasonable the final scores
  > which is hard to tell for sure until it is working.
  > The coding standards for Lucene seem reasonably clear from the
  > code I've read.  I could use just simple Java so there shouldn't be
  > significant JVM dependencies.  The above should be fully backward
  > compatible due to the newScoring flag.
  > On another note, I had to remove the German analyzer in my current
  > source configuration because GermanStemmer failed to compile due to
  > are apparently Unicode character constants that I've now got as
  > two-character character constants.  This is presumably an encoding
  > problem somewhere that I could track down.  It's not important, but
  > the answer is obvious to any of you, I'd appreciate the quick tip.
  > Thanks,
  > Chuck
  >   > -----Original Message-----
  >   > From: Doug Cutting []
  >   > Sent: Monday, October 18, 2004 9:44 AM
  >   > To: Lucene Developers List
  >   > Subject: Re: idf and explain(), was Re: Search and Scoring
  >   >
  >   > Chuck Williams wrote:
  >   > > That's a good point on how the standard vector space inner
  >   > > similarity measure does imply that the idf is squared relative
  > the
  >   > > document tf.  Even having been aware of this formula for a
  > time,
  >   > > this particular implication never occurred to me.  Do you know
  >   > > anybody has done precision/recall or other relevancy empirical
  >   > > measurements comparing this vs. a model that does not square
  >   >
  >   > No, not that I know of.
  >   >
  >   > > Regarding normalization, the normalization in Hits does not
  > very
  >   > > nice properties.  Due to the > 1.0 threshold check, it loses
  >   > > information, and it arbitrarily defines the highest scoring
  > in
  >   > > any list that generates scores above 1.0 as a perfect match.
  > would
  >   > > be nice if score values were meaningful independent of
  > e.g.,
  >   > > if "0.6" meant the same quality of retrieval independent of
  >   > search
  >   > > was done.  This would allow, for example, sites to use a a
  >   > > quality threshold to only show results that were "good
  > At my
  >   > > last company (I was President and head of engineering for
  > we
  >   > > found this to be important to many customers.
  >   >
  >   > If this is a big issue for you, as it seems it is, please submit
  > patch
  >   > to optionally disable score normalization in
  >   >
  >   > > The standard vector space similarity measure includes
  > normalization by
  >   > > the product of the norms of the vectors, i.e.:
  >   > >
  >   > >   score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) ) /
  >   > >                 sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
  >   > weight(t,d)^2) ]
  >   > >
  >   > > This makes the score a cosine, which since the values are all
  > positive,
  >   > > forces it to be in [0, 1].  The sumOfSquares() normalization
  > Lucene
  >   > > does not fully implement this.  Is there a specific reason for
  > that?
  >   >
  >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed for each
  > document
  >   > each time a document is added to the collection, since
  > is
  >   > dependent on global term statistics.  This is prohibitivly
  >   > Research has also demonstrated that such cosine normalization
  >   > somewhat inferior results (e.g., Singhal's pivoted length
  > normalization).
  >   >
  >   > > Re. explain(), I don't see a downside to extending it show the
  > final
  >   > > normalization in Hits.  It could still show the raw score just
  > prior
  >   > to
  >   > > that normalization.
  >   >
  >   > In order to normalize scores to 1.0 one must know the maximum
  >   > Explain only computes the score for a single document, and the
  > maximum
  >   > score is not known.
  >   >
  >   >  > Although I think it would be best to have a
  >   >  > normalization that would render scores comparable across
  >   >
  >   > Then please submit a patch.  Lucene doesn't change on its own.
  >   >
  >   > Doug
  >   >
  >   >
  >   >
  >   >
  > -
  >   > To unsubscribe, e-mail:
  >   > For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message