lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Steichen" <te...@net-frame.com>
Subject Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring
Date Thu, 21 Oct 2004 19:36:53 GMT
+1
  ----- Original Message ----- 
  From: Otis Gospodnetic 
  To: Lucene Developers List 
  Sent: Thursday, October 21, 2004 1:05 PM
  Subject: Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring


  Hi Chuck,

  The relative lack of responses is not because there is no interest, but
  probably because there are only a few people on lucene-dev who can
  follow/understand every detail of your proposal.  I understand and hear
  you, but I have a hard time 'visualizing' some of the formulas in your
  proposal.  What you are saying sounds right to me, but I don't have
  enough theoretical knowledge to go one way or the other.

  Otis


  --- Chuck Williams <chuck@manawiz.com> wrote:

  > Hi everybody,
  > 
  > Although there doesn't seem to be much interest in this I have one
  > significant improvement to the below and a couple observations that
  > clarify the situation.
  > 
  > To illustrate the problem better normalization is intended to
  > address,
  > in my current application for BooleanQuery's of two terms, I
  > frequently
  > get a top score of 1.0 when only one of the terms is matched.  1.0
  > should indicate a "perfect match".  I'd like set my UI up to present
  > the
  > results differently depending on how good the different results are
  > (e.g., showing a visual indication of result quality, dropping the
  > really bad results entirely, and segregating the good results from
  > other
  > only vaguely relevant results).  To build this kind of "intelligence"
  > into the UI, I certainly need to know whether my top result matched
  > all
  > query terms, or only half of them.  I'd like to have the score tell
  > me
  > directly how good the matches are.  The current normalization does
  > not
  > achieve this.
  > 
  > The intent of a new normalization scheme is to preserve the current
  > scoring in the following sense:  the ratio of the nth result's score
  > to
  > the best result's score remains the same.  Therefore, the only
  > question
  > is what normalization factor to apply to all scores.  This reduces to
  > a
  > very specific question that determines the entire normalization --
  > what
  > should the score of the best matching result be?
  > 
  > The mechanism below has this property, i.e. it keeps the current
  > score
  > ratios, except that I removed one idf term for reasons covered
  > earlier
  > (this better reflects the current empirically best scoring
  > algorithms).
  > However, removing an idf is a completely separate issue.  The
  > improved
  > normalization is independent of whether or not that change is made.
  > 
  > For the central question of what the top score should be, upon
  > reflection, I don't like the definition below.  It defined the top
  > score
  > as (approximately) the percentage of query terms matched by the top
  > scoring result.  A better conceptual definition is to use a weighted
  > average based on the boosts.  I.e., downward propagate all boosts to
  > the
  > underlying terms (or phrases).  Secifically, the "net boost" of a
  > term
  > is the product of the direct boost of the term and all boosts applied
  > to
  > encompassing clauses.  Then the score of the top result becomes the
  > sum
  > of the net boosts of its matching terms divided by the sum of the net
  > boosts of all query terms.
  > 
  > This definition is a refinement of the original proposal below, and
  > it
  > could probably be further refined if some impact of the tf, idf
  > and/or
  > lengthNorm was desired in determining the top score.  These other
  > factors seems to be harder to normalize for, although I've thought of
  > some simple approaches; e.g., assume the unmatched terms in the top
  > result have values for these three factors that are the average of
  > the
  > values of the matched terms, then apply exactly the same concept of
  > actual score divided by theorectical maximum score.  That would
  > eliminate any need to maintain maximum value statistics in the index.
  > 
  > As an example of the simple boost-based normalization, for the query
  >   ((a^2 b)^3 (c d^2))
  > the net boosts are:
  >   a --> 6
  >   b --> 3
  >   c --> 1
  >   d --> 2
  > 
  > So if a and b matched, but not c and d, in the top scoring result,
  > its
  > score would be 0.75.  The normalizer would be 0.75/(current score
  > except
  > for the current normalization).  This normalizer would be applied to
  > all
  > current scores (minus normalization) to create the normalized scores.
  > 
  > For simple query (a b), if only one of the terms matched in the top
  > result, then its score would be 0.5, vs. 1.0 or many other possible
  > scores today.
  > 
  > In addition to enabling more "intelligent" UI's that communicate the
  > quality of results to end-users, the proposal below also extends the
  > explain() mechanism to fully explain the final normalized score.
  > However, that change is also independent -- it could be done with the
  > current scoring.
  > 
  > Am I the only one who would like to see better normalization in
  > Lucene?
  > Does anybody have a better approach?
  > 
  > If you've read this far, thanks for indulging me on this.  I would
  > love
  > to see this or something with similar properties in Lucene.  I really
  > just want to build my app, but as stated below would write and
  > contribute this if there is interest in putting it in, and nobody
  > else
  > wants to write it.  Please let me know what you think one way or the
  > other.
  > 
  > Thanks,
  > 
  > Chuck
  >  
  > 
  >   > -----Original Message-----
  >   > From: Chuck Williams
  >   > Sent: Monday, October 18, 2004 7:04 PM
  >   > To: 'Lucene Developers List'
  >   > Subject: RE: idf and explain(), was Re: Search and Scoring
  >   > 
  >   > Doug Cutting wrote:
  >   >   > If this is a big issue for you, as it seems it is, please
  > submit
  > a
  >   > patch
  >   >   > to optionally disable score normalization in Hits.java.
  >   > and:
  >   >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed for
  > each
  >   > document
  >   >   > each time a document is added to the collection, since
  > 'weight(t,d)'
  >   > is
  >   >   > dependent on global term statistics.  This is prohibitivly
  > expensive.
  >   >   > Research has also demonstrated that such cosine normalization
  > gives
  >   >   > somewhat inferior results (e.g., Singhal's pivoted length
  >   > normalization).
  >   > 
  >   > I'm willing to write, test and contribute code to address the
  >   > normalization issue, i.e. to yield scores in [0, 1] that are
  > meaningful
  >   > across searches.  Unfortunately, this is considerably more
  > involved
  > that
  >   > just optionally eliminating the current normalization in Hits.
  > Before
  >   > undertaking this, I'd like to see if there is agreement in
  > principle
  >   > that this is a good idea, and that my specific proposal below is
  > the
  >   > right way to go.  Also, I'd like to make sure I've correctly
  > inferred
  >   > the constraints on writing code to be incorporated into Lucene.
  >   > 
  >   > After looking at this in more detail I agree that the cosine
  >   > normalization is not the way to go, because of both efficiency
  > and
  >   > effectiveness considerations.  A conceptual approach that would
  > be
  >   > efficient, relatively easy to implement, and seems to have at
  > least
  >   > reasonable behavior would be to define the top scoring match to
  > have
  > a
  >   > score roughly equal to the percentage of query terms it matches
  > (its
  >   > "netCoord").  Scores below the top hit would be reduced based on
  > the
  >   > ratio of their raw scores to the raw score of the top hit
  > (considering
  >   > all of the current score factors, except that I'd like to remove
  > one
  > of
  >   > the idf factors, as discussed earlier).
  >   > 
  >   > For a couple simple cases:
  >   >   a) the top match for a single term query would always have a
  > score
  > of
  >   > 1.0,
  >   >   b) the top scoring match for a BooleanQuery using
  > DefaultSimilarity
  >   > with all non-prohibited TermQuery clauses would have a score of
  > m/n,
  >   > where the hit matches m of the n terms.
  >   > 
  >   > This isn't optimal, but seems much better than the current
  > situation.
  >   > Consider two single-term queries, s and t.  If s matches more
  > strongly
  >   > than t in its top hit (e.g., a higher tf in a shorter field), it
  > would
  >   > be best if the top score of s was greater score than top score of
  > t.
  >   > But this kind of normalization would require keeping additional
  >   > statistics that so far as I know are not currently in the index,
  > like
  >   > the maximum tf for terms and the minimum length for fields. 
  > These
  > could
  >   > be expensive to update on deletes.  Also, normalizing by such
  > factors
  >   > could yield lower than subjectively reasonable scores in most
  > cases,
  > so
  >   > it's not clear it would be better.
  >   > 
  >   > The semantics above are at least clean, easy to understand, and
  > support
  >   > what seems to me is the most important motivation to do this:
  > allowing
  >   > an application to use simple thresholding to segregate
  > likely-to-be-
  >   > relevant hits from likely-to-be-irrelevant hits.
  >   > 
  >   > More specifically, for a BooleanQuery of TermQuery's the
  > resulting
  > score
  >   > functions would be:
  >   > 
  >   > BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
  >   > 
  >   > baseScore(sbq, doc) =
  >   >   sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
  >   >            lengthNorm(tqi.term.field, doc)
  >   > 
  >   > rawScore(sbq, doc) = coord(sbq, doc) * baseScore
  >   > 
  >   > norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
  >   > 
  >   > score(sbq, doc) = rawScore * norm
  >   > 
  >   > rawScore's can be computed in the Scorer.score() methods and
  > therefore
  >   > used to sort the hits (e.g., in the instance method for collect()
  > in
  > the
  >   > HitCollector in IndexSearcher.search()).  If the top scoring hit
  > does
  >   > not have the highest baseScore, then its score could be less that
  > its
  >   > coord; this seems desirable.  These formulas imply that no result
  > score
  >   > can be larger than its coord, so if coord is well-defined (always
  >   > between 0 and 1) then all results will be normalized between 0
  > and
  > 1.
  >   > 
  >   > In general, the netCoord, which takes the place of coord in the
  > simple
  >   > case above, needs to be defined for any query.  Conceptually,
  > this
  >   > should be the total percentage of query terms matched by the
  > document.
  >   > It must be recursively computable from the subquery structure and
  >   > overridable for specific Query types (e.g., to support
  > specialized
  >   > coords, like one that is always 1.0 as is useful in multi-field
  >   > searching).  Suitable default definitions for TermQuery and
  > BooleanQuery
  >   > are:
  >   > 
  >   > TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
  >   > 
  >   > BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
  > ci.netCoord
  >   > 
  >   > This is not quite percentage of terms matched; e.g., consider a
  >   > BooleanQuery with two clauses, one of which is a BooleanQuery of
  > 3
  > terms
  >   > and the other which is just a term.  However, it doesn't seem to
  > be
  >   > unreasonable for a BooleanQuery to state that its clauses are
  > equally
  >   > important, and this is consistent with the current coord
  > behavior.
  >   > BooleanQuery.netCoord could be overridden for special cases, like
  > the
  >   > pure disjunction I use in my app for field expansions:
  >   > MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
  >   > 
  >   > Implementing this would proceed along these lines:
  >   >   1.  For backwards compatibility, add some kind of newScoring
  > boolean
  >   > setting.
  >   >   2.  Update all of these places to behave as indicated if
  > newScoring:
  >   >     a.  Change Query.weight() to not do any normalization (no
  > call
  > to
  >   > sumOfSquaredWeights(), Similarity.queryNorm() or normalize()).
  >   >     b.  Update all Query.weight classes to set their value
  > according
  > to
  >   > the terms in the score formula above that don't involve the
  > document
  >   > (e.g., boost*idf for TermQuery).
  >   >     c.  Add suitable netCoord definitions to all Scorer classes.
  >   >     d.  Update all Scorer.score() methods to compute the rawScore
  > as
  >   > above.
  >   >     e.  Add the maximum baseScore as a field kept on TopDocs,
  > computed
  >   > in the HitCollector's.
  >   >     f.  Change the normalization in Hits to always divide every
  > raw
  >   > score by the maximum baseScore.
  >   >     g.  Update all of the current explain() methods to be
  > consistent
  >   > with this scoring, and to either report the rawScore, or to
  > report
  > the
  >   > final score if the normalization factor is provided.
  >   >     h.  Add Hits.explain() (or better extend Searcher so that it
  > keeps
  >   > the Hits for use in Searcher.explain()) to call the new explain
  >   > variation with the normalization factor so that final scores are
  > fully
  >   > explained.
  >   > 
  >   > If this seems like a good idea, please let me know.  I'm sure
  > there
  > are
  >   > details I've missed that would come out during coding and
  > testing.
  > Also,
  >   > the value of this is dependent on how reasonable the final scores
  > look,
  >   > which is hard to tell for sure until it is working.
  >   > 
  >   > The coding standards for Lucene seem reasonably clear from the
  > source
  >   > code I've read.  I could use just simple Java so there shouldn't
  > be
  > any
  >   > significant JVM dependencies.  The above should be fully backward
  >   > compatible due to the newScoring flag.
  >   > 
  >   > On another note, I had to remove the German analyzer in my
  > current
  > 1.4.2
  >   > source configuration because GermanStemmer failed to compile due
  > to
  > what
  >   > are apparently Unicode character constants that I've now got as
  > illegal
  >   > two-character character constants.  This is presumably an
  > encoding
  >   > problem somewhere that I could track down.  It's not important,
  > but
  > if
  >   > the answer is obvious to any of you, I'd appreciate the quick
  > tip.
  >   > 
  >   > Thanks,
  >   > 
  >   > Chuck
  >   > 
  >   >   > -----Original Message-----
  >   >   > From: Doug Cutting [mailto:cutting@apache.org]
  >   >   > Sent: Monday, October 18, 2004 9:44 AM
  >   >   > To: Lucene Developers List
  >   >   > Subject: Re: idf and explain(), was Re: Search and Scoring
  >   >   >
  >   >   > Chuck Williams wrote:
  >   >   > > That's a good point on how the standard vector space inner
  > product
  >   >   > > similarity measure does imply that the idf is squared
  > relative
  > to
  >   > the
  >   >   > > document tf.  Even having been aware of this formula for a
  > long
  >   > time,
  >   >   > > this particular implication never occurred to me.  Do you
  > know
  > if
  >   >   > > anybody has done precision/recall or other relevancy
  > empirical
  >   >   > > measurements comparing this vs. a model that does not
  > square
  > idf?
  >   >   >
  >   >   > No, not that I know of.
  >   >   >
  >   >   > > Regarding normalization, the normalization in Hits does not
  > have
  >   > very
  >   >   > > nice properties.  Due to the > 1.0 threshold check, it
  > loses
  >   >   > > information, and it arbitrarily defines the highest scoring
  > result
  >   > in
  >   >   > > any list that generates scores above 1.0 as a perfect
  > match.
  > It
  >   > would
  >   >   > > be nice if score values were meaningful independent of
  > searches,
  >   > e.g.,
  >   >   > > if "0.6" meant the same quality of retrieval independent of
  > what
  >   >   > search
  >   >   > > was done.  This would allow, for example, sites to use a a
  > simple
  >   >   > > quality threshold to only show results that were "good
  > enough".
  >   > At my
  >   >   > > last company (I was President and head of engineering for
  > InQuira),
  >   > we
  >   >   > > found this to be important to many customers.
  >   >   >
  >   >   > If this is a big issue for you, as it seems it is, please
  > submit
  > a
  >   > patch
  >   >   > to optionally disable score normalization in Hits.java.
  >   >   >
  >   >   > > The standard vector space similarity measure includes
  >   > normalization by
  >   >   > > the product of the norms of the vectors, i.e.:
  >   >   > >
  >   >   > >   score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
  > /
  >   >   > >                 sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
  >   >   > weight(t,d)^2) ]
  >   >   > >
  >   >   > > This makes the score a cosine, which since the values are
  > all
  >   > positive,
  >   >   > > forces it to be in [0, 1].  The sumOfSquares()
  > normalization
  > in
  >   > Lucene
  >   >   > > does not fully implement this.  Is there a specific reason
  > for
  >   > that?
  >   >   >
  >   >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed for
  > each
  >   > document
  >   >   > each time a document is added to the collection, since
  > 'weight(t,d)'
  >   > is
  >   >   > dependent on global term statistics.  This is prohibitivly
  > expensive.
  >   >   > Research has also demonstrated that such cosine normalization
  > gives
  >   >   > somewhat inferior results (e.g., Singhal's pivoted length
  >   > normalization).
  >   >   >
  >   >   > > Re. explain(), I don't see a downside to extending it show
  > the
  >   > final
  >   >   > > normalization in Hits.  It could still show the raw score
  > just
  >   > prior
  >   >   > to
  >   >   > > that normalization.
  >   >   >
  >   >   > In order to normalize scores to 1.0 one must know the maximum
  > score.
  >   >   > Explain only computes the score for a single document, and
  > the
  >   > maximum
  >   >   > score is not known.
  >   >   >
  >   >   >  > Although I think it would be best to have a
  >   >   >  > normalization that would render scores comparable across
  > searches.
  >   >   >
  >   >   > Then please submit a patch.  Lucene doesn't change on its
  > own.
  >   >   >
  >   >   > Doug
  >   >   >
  >   >   >
  >   >   >
  >   >   >
  > --------------------------------------------------------------------
  >   > -
  >   >   > To unsubscribe, e-mail:
  > lucene-dev-unsubscribe@jakarta.apache.org
  >   >   > For additional commands, e-mail:
  > lucene-dev-help@jakarta.apache.org
  > 
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  > 
  > 


  ---------------------------------------------------------------------
  To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message