lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelvin Tan <kelvin-li...@relevanz.com>
Subject Study Group (WAS Re: Normalized Scoring)
Date Sun, 06 Feb 2005 09:14:16 GMT
Wouldn't it be great if we can form a study-group of Lucene folks who want to take the "next
step"? I feel uneasy posting non-Lucene specific questions to dev or user even if its related
to IR.

Feels to me like there could be a couple like us, who didn't do a dissertation in IR, but
would like a more indepth knowledge for practical purposes. Basically, the end result is that
we are able to tune or extend lucene by using the Expert api (classes marked as Expert). Perhaps
a possible outcome is a tuning tutorial for advanced users who already know how to use Lucene.

What do you think?

k

On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
> Exactly.  Luckily, since then I've learned a bit from lucene-dev
> discussions and side IR readings, so some of the topics are making
> more sense now.
>
> Otis
>
> --- Kelvin Tan <kelvin-lists@relevanz.com> wrote:
>
>> Hi Otis, I was re-reading this whole theoretical thread about
>> idf, scoring, normalization, etc from last Oct and couldn't help
>> laughing out loud when I read your post, coz it summed up what I
>> was thinking the whole time. I think its really great to have
>> people like Chuck and Paul (Eshlot) around. I'm learning so much.
>>
>> k
>>
>> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
>>
>>> Hi Chuck,
>>>
>>> The relative lack of responses is not because there is no
>>> interest, but probably because there are only a few people on
>>> lucene-dev who can follow/understand every detail of your
>>> proposal.  I understand and hear you, but I have a hard time
>>> 'visualizing' some of the formulas in your proposal.  What you
>>> are saying sounds right to me, but I don't have enough
>>> theoretical knowledge to go one way or the other.
>>>
>>> Otis
>>>
>>>
>>> --- Chuck Williams <chuck@manawiz.com> wrote:
>>>
>>>> Hi everybody,
>>>>
>>>> Although there doesn't seem to be much interest in this I
>>>> have one significant improvement to the below and a couple
>>>> observations that clarify the situation.
>>>>
>>>> To illustrate the problem better normalization is intended to
>>>> address,
>>>> in my current application for BooleanQuery's of two terms, I
>>>> frequently
>>>> get a top score of 1.0 when only one of the terms is matched.
>>>> 1.0 should indicate a "perfect match".  I'd like set my UI up
>>>> to present the
>>>> results differently depending on how good the different
>>>> results are (e.g., showing a visual indication of result
>>>> quality, dropping the really bad results entirely, and
>>>> segregating the good results from other
>>>> only vaguely relevant results).  To build this kind of
>>>> "intelligence" into the UI, I certainly need to know whether
>>>> my top result matched all
>>>> query terms, or only half of them.  I'd like to have the
>>>> score tell me
>>>> directly how good the matches are.  The current normalization
>>>> does not achieve this.
>>>>
>>>> The intent of a new normalization scheme is to preserve the
>>>> current scoring in the following sense:  the ratio of the nth
>>>> result's score to
>>>> the best result's score remains the same.  Therefore, the
>>>> only question
>>>> is what normalization factor to apply to all scores.  This
>>>> reduces to a
>>>> very specific question that determines the entire
>>>> normalization -- what should the score of the best matching
>>>> result be?
>>>>
>>>> The mechanism below has this property, i.e. it keeps the
>>>> current score
>>>> ratios, except that I removed one idf term for reasons
>>>> covered earlier
>>>> (this better reflects the current empirically best scoring
>>>> algorithms).
>>>> However, removing an idf is a completely separate issue.  The
>>>> improved
>>>> normalization is independent of whether or not that change is
>>>> made.
>>>>
>>>> For the central question of what the top score should be,
>>>> upon reflection, I don't like the definition below.  It
>>>> defined the top score
>>>> as (approximately) the percentage of query terms matched by
>>>> the top scoring result.  A better conceptual definition is to
>>>> use a weighted average based on the boosts.  I.e., downward
>>>> propagate all boosts to the
>>>> underlying terms (or phrases).  Secifically, the "net boost"
>>>> of a term
>>>> is the product of the direct boost of the term and all boosts
>>>> applied to
>>>> encompassing clauses.  Then the score of the top result
>>>> becomes the sum
>>>> of the net boosts of its matching terms divided by the sum of
>>>> the net boosts of all query terms.
>>>>
>>>> This definition is a refinement of the original proposal
>>>> below, and it
>>>> could probably be further refined if some impact of the tf,
>>>> idf and/or
>>>> lengthNorm was desired in determining the top score.  These
>>>> other factors seems to be harder to normalize for, although
>>>> I've thought of some simple approaches; e.g., assume the
>>>> unmatched terms in the top result have values for these three
>>>> factors that are the average of the
>>>> values of the matched terms, then apply exactly the same
>>>> concept of actual score divided by theorectical maximum
>>>> score.  That would eliminate any need to maintain maximum
>>>> value statistics in the index.
>>>>
>>>> As an example of the simple boost-based normalization, for
>>>> the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --
>>>> > 3 c --
>>>>
>>>>> 1 d --> 2
>>>>>
>>>> So if a and b matched, but not c and d, in the top scoring
>>>> result, its
>>>> score would be 0.75.  The normalizer would be 0.75/(current
>>>> score except
>>>> for the current normalization).  This normalizer would be
>>>> applied to all
>>>> current scores (minus normalization) to create the normalized
>>>> scores.
>>>>
>>>> For simple query (a b), if only one of the terms matched in
>>>> the top result, then its score would be 0.5, vs. 1.0 or many
>>>> other possible scores today.
>>>>
>>>> In addition to enabling more "intelligent" UI's that
>>>> communicate the quality of results to end-users, the proposal
>>>> below also extends the explain() mechanism to fully explain
>>>> the final normalized score. However, that change is also
>>>> independent -- it could be done with the current scoring.
>>>>
>>>> Am I the only one who would like to see better normalization
>>>> in Lucene? Does anybody have a better approach?
>>>>
>>>> If you've read this far, thanks for indulging me on this.  I
>>>> would love
>>>> to see this or something with similar properties in Lucene.
>>>> I really just want to build my app, but as stated below would
>>>> write and contribute this if there is interest in putting it
>>>> in, and nobody else
>>>> wants to write it.  Please let me know what you think one way
>>>> or the other.
>>>>
>>>> Thanks,
>>>>
>>>> Chuck
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Chuck Williams
>>>>> Sent: Monday, October 18, 2004 7:04 PM
>>>>> To: 'Lucene Developers List'
>>>>> Subject: RE: idf and explain(), was Re: Search and Scoring
>>>>>
>>>>> Doug Cutting wrote:
>>>>>> If this is a big issue for you, as it seems it is, please
>>>>>>
>>>> submit
>>>> a
>>>>> patch
>>>>>> to optionally disable score normalization in Hits.java.
>>>>>>
>>>>> and:
>>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>>>
>>>> each
>>>>> document
>>>>>> each time a document is added to the collection, since
>>>>>>
>>>> 'weight(t,d)'
>>>>> is
>>>>>> dependent on global term statistics.  This is prohibitivly
>>>>>>
>>>> expensive.
>>>>>> Research has also demonstrated that such cosine
>>>>>> normalization
>>>>>>
>>>> gives
>>>>>> somewhat inferior results (e.g., Singhal's pivoted length
>>>>>>
>>>>> normalization).
>>>>>
>>>>> I'm willing to write, test and contribute code to address
>>>>> the normalization issue, i.e. to yield scores in [0, 1]
>>>>> that are
>>>>>
>>>> meaningful
>>>>> across searches.  Unfortunately, this is considerably more
>>>>>
>>>> involved
>>>> that
>>>>> just optionally eliminating the current normalization in
>>>>> Hits.
>>>>>
>>>> Before
>>>>> undertaking this, I'd like to see if there is agreement in
>>>>>
>>>> principle
>>>>> that this is a good idea, and that my specific proposal
>>>>> below is
>>>>>
>>>> the
>>>>> right way to go.  Also, I'd like to make sure I've correctly
>>>>>
>>>> inferred
>>>>> the constraints on writing code to be incorporated into
>>>>> Lucene.
>>>>>
>>>>> After looking at this in more detail I agree that the
>>>>> cosine normalization is not the way to go, because of both
>>>>> efficiency
>>>>>
>>>> and
>>>>> effectiveness considerations.  A conceptual approach that
>>>>> would
>>>>>
>>>> be
>>>>> efficient, relatively easy to implement, and seems to have
>>>>> at
>>>>>
>>>> least
>>>>> reasonable behavior would be to define the top scoring
>>>>> match to
>>>>>
>>>> have
>>>> a
>>>>> score roughly equal to the percentage of query terms it
>>>>> matches
>>>>>
>>>> (its
>>>>> "netCoord").  Scores below the top hit would be reduced
>>>>> based on
>>>>>
>>>> the
>>>>> ratio of their raw scores to the raw score of the top hit
>>>>>
>>>> (considering
>>>>> all of the current score factors, except that I'd like to
>>>>> remove
>>>>>
>>>> one
>>>> of
>>>>> the idf factors, as discussed earlier).
>>>>>
>>>>> For a couple simple cases:
>>>>> a) the top match for a single term query would always have a
>>>>>
>>>> score
>>>> of
>>>>> 1.0,
>>>>> b) the top scoring match for a BooleanQuery using
>>>>>
>>>> DefaultSimilarity
>>>>> with all non-prohibited TermQuery clauses would have a
>>>>> score of
>>>>>
>>>> m/n,
>>>>> where the hit matches m of the n terms.
>>>>>
>>>>> This isn't optimal, but seems much better than the current
>>>>>
>>>> situation.
>>>>> Consider two single-term queries, s and t.  If s matches
>>>>> more
>>>>>
>>>> strongly
>>>>> than t in its top hit (e.g., a higher tf in a shorter
>>>>> field), it
>>>>>
>>>> would
>>>>> be best if the top score of s was greater score than top
>>>>> score of
>>>>>
>>>> t.
>>>>> But this kind of normalization would require keeping
>>>>> additional statistics that so far as I know are not
>>>>> currently in the index,
>>>>>
>>>> like
>>>>> the maximum tf for terms and the minimum length for fields.
>>>>>
>>>> These
>>>> could
>>>>> be expensive to update on deletes.  Also, normalizing by
>>>>> such
>>>>>
>>>> factors
>>>>> could yield lower than subjectively reasonable scores in
>>>>> most
>>>>>
>>>> cases,
>>>> so
>>>>> it's not clear it would be better.
>>>>>
>>>>> The semantics above are at least clean, easy to understand,
>>>>> and
>>>>>
>>>> support
>>>>> what seems to me is the most important motivation to do
>>>>> this:
>>>>>
>>>> allowing
>>>>> an application to use simple thresholding to segregate
>>>>>
>>>> likely-to-be-
>>>>> relevant hits from likely-to-be-irrelevant hits.
>>>>>
>>>>> More specifically, for a BooleanQuery of TermQuery's the
>>>>>
>>>> resulting
>>>> score
>>>>> functions would be:
>>>>>
>>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
>>>>>
>>>>> baseScore(sbq, doc) =
>>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>>>>> lengthNorm(tqi.term.field, doc)
>>>>>
>>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore
>>>>>
>>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
>>>>>
>>>>> score(sbq, doc) = rawScore * norm
>>>>>
>>>>> rawScore's can be computed in the Scorer.score() methods and
>>>>>
>>>> therefore
>>>>> used to sort the hits (e.g., in the instance method for
>>>>> collect()
>>>>>
>>>> in
>>>> the
>>>>> HitCollector in IndexSearcher.search()).  If the top
>>>>> scoring hit
>>>>>
>>>> does
>>>>> not have the highest baseScore, then its score could be
>>>>> less that
>>>>>
>>>> its
>>>>> coord; this seems desirable.  These formulas imply that no
>>>>> result
>>>>>
>>>> score
>>>>> can be larger than its coord, so if coord is well-defined
>>>>> (always between 0 and 1) then all results will be
>>>>> normalized between 0
>>>>>
>>>> and
>>>> 1.
>>>>
>>>>> In general, the netCoord, which takes the place of coord in
>>>>> the
>>>>>
>>>> simple
>>>>> case above, needs to be defined for any query.
>>>>> Conceptually,
>>>>>
>>>> this
>>>>> should be the total percentage of query terms matched by the
>>>>>
>>>> document.
>>>>> It must be recursively computable from the subquery
>>>>> structure and overridable for specific Query types (e.g.,
>>>>> to support
>>>>>
>>>> specialized
>>>>> coords, like one that is always 1.0 as is useful in multi-
>>>>> field searching).  Suitable default definitions for
>>>>> TermQuery and
>>>>>
>>>> BooleanQuery
>>>>> are:
>>>>>
>>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
>>>>>
>>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
>>>>>
>>>> ci.netCoord
>>>>
>>>>> This is not quite percentage of terms matched; e.g.,
>>>>> consider a BooleanQuery with two clauses, one of which is a
>>>>> BooleanQuery of
>>>>>
>>>> 3
>>>> terms
>>>>> and the other which is just a term.  However, it doesn't
>>>>> seem to
>>>>>
>>>> be
>>>>> unreasonable for a BooleanQuery to state that its clauses
>>>>> are
>>>>>
>>>> equally
>>>>> important, and this is consistent with the current coord
>>>>>
>>>> behavior.
>>>>> BooleanQuery.netCoord could be overridden for special
>>>>> cases, like
>>>>>
>>>> the
>>>>> pure disjunction I use in my app for field expansions:
>>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
>>>>>
>>>>> Implementing this would proceed along these lines: 1.  For
>>>>> backwards compatibility, add some kind of newScoring
>>>>>
>>>> boolean
>>>>> setting.
>>>>> 2.  Update all of these places to behave as indicated if
>>>>>
>>>> newScoring:
>>>>> a.  Change Query.weight() to not do any normalization (no
>>>>>
>>>> call
>>>> to
>>>>> sumOfSquaredWeights(), Similarity.queryNorm() or
>>>>> normalize()). b.  Update all Query.weight classes to set
>>>>> their value
>>>>>
>>>> according
>>>> to
>>>>> the terms in the score formula above that don't involve the
>>>>>
>>>> document
>>>>> (e.g., boost*idf for TermQuery).
>>>>> c.  Add suitable netCoord definitions to all Scorer
>>>>> classes. d. Update all Scorer.score() methods to compute
>>>>> the rawScore
>>>>>
>>>> as
>>>>> above.
>>>>> e.  Add the maximum baseScore as a field kept on TopDocs,
>>>>>
>>>> computed
>>>>> in the HitCollector's.
>>>>> f.  Change the normalization in Hits to always divide every
>>>>>
>>>> raw
>>>>> score by the maximum baseScore.
>>>>> g.  Update all of the current explain() methods to be
>>>>>
>>>> consistent
>>>>> with this scoring, and to either report the rawScore, or to
>>>>>
>>>> report
>>>> the
>>>>> final score if the normalization factor is provided. h.
>>>>> Add Hits.explain() (or better extend Searcher so that it
>>>>>
>>>> keeps
>>>>> the Hits for use in Searcher.explain()) to call the new
>>>>> explain variation with the normalization factor so that
>>>>> final scores are
>>>>>
>>>> fully
>>>>> explained.
>>>>>
>>>>> If this seems like a good idea, please let me know.  I'm
>>>>> sure
>>>>>
>>>> there
>>>> are
>>>>> details I've missed that would come out during coding and
>>>>>
>>>> testing.
>>>> Also,
>>>>> the value of this is dependent on how reasonable the final
>>>>> scores
>>>>>
>>>> look,
>>>>> which is hard to tell for sure until it is working.
>>>>>
>>>>> The coding standards for Lucene seem reasonably clear from
>>>>> the
>>>>>
>>>> source
>>>>> code I've read.  I could use just simple Java so there
>>>>> shouldn't
>>>>>
>>>> be
>>>> any
>>>>> significant JVM dependencies.  The above should be fully
>>>>> backward compatible due to the newScoring flag.
>>>>>
>>>>> On another note, I had to remove the German analyzer in my
>>>>>
>>>> current
>>>> 1.4.2
>>>>> source configuration because GermanStemmer failed to
>>>>> compile due
>>>>>
>>>> to
>>>> what
>>>>> are apparently Unicode character constants that I've now
>>>>> got as
>>>>>
>>>> illegal
>>>>> two-character character constants.  This is presumably an
>>>>>
>>>> encoding
>>>>> problem somewhere that I could track down.  It's not
>>>>> important,
>>>>>
>>>> but
>>>> if
>>>>> the answer is obvious to any of you, I'd appreciate the
>>>>> quick
>>>>>
>>>> tip.
>>>>
>>>>> Thanks,
>>>>>
>>>>> Chuck
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Doug Cutting [mailto:cutting@apache.org] Sent:
>>>>>> Monday, October 18, 2004 9:44 AM To: Lucene Developers
>>>>>> List Subject: Re: idf and explain(), was Re: Search and
>>>>>> Scoring
>>>>>>
>>>>>> Chuck Williams wrote:
>>>>>>> That's a good point on how the standard vector space
>>>>>>> inner
>>>>>>>
>>>> product
>>>>>>> similarity measure does imply that the idf is squared
>>>>>>>
>>>> relative
>>>> to
>>>>> the
>>>>>>> document tf.  Even having been aware of this formula
>>>>>>> for a
>>>>>>>
>>>> long
>>>>> time,
>>>>>>> this particular implication never occurred to me.  Do
>>>>>>> you
>>>>>>>
>>>> know
>>>> if
>>>>>>> anybody has done precision/recall or other relevancy
>>>>>>>
>>>> empirical
>>>>>>> measurements comparing this vs. a model that does not
>>>>>>>
>>>> square
>>>> idf?
>>>>
>>>>>> No, not that I know of.
>>>>>>
>>>>>>> Regarding normalization, the normalization in Hits does
>>>>>>> not
>>>>>>>
>>>> have
>>>>> very
>>>>>>> nice properties.  Due to the > 1.0 threshold check, it
>>>>>>>
>>>> loses
>>>>>>> information, and it arbitrarily defines the highest
>>>>>>> scoring
>>>>>>>
>>>> result
>>>>> in
>>>>>>> any list that generates scores above 1.0 as a perfect
>>>>>>>
>>>> match.
>>>> It
>>>>> would
>>>>>>> be nice if score values were meaningful independent of
>>>>>>>
>>>> searches,
>>>>> e.g.,
>>>>>>> if "0.6" meant the same quality of retrieval
>>>>>>> independent of
>>>>>>>
>>>> what
>>>>>> search
>>>>>>> was done.  This would allow, for example, sites to use
>>>>>>> a a
>>>>>>>
>>>> simple
>>>>>>> quality threshold to only show results that were "good
>>>>>>>
>>>> enough".
>>>>> At my
>>>>>>> last company (I was President and head of engineering
>>>>>>> for
>>>>>>>
>>>> InQuira),
>>>>> we
>>>>>>> found this to be important to many customers.
>>>>>>>
>>>>>> If this is a big issue for you, as it seems it is, please
>>>>>>
>>>> submit
>>>> a
>>>>> patch
>>>>>> to optionally disable score normalization in Hits.java.
>>>>>>
>>>>>>> The standard vector space similarity measure includes
>>>>>>>
>>>>> normalization by
>>>>>>> the product of the norms of the vectors, i.e.:
>>>>>>>
>>>>>>> score(d,q) =  sum over t of ( weight(t,q) * weight(t,d)
>>>>>>> )
>>>>>>>
>>>> /
>>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
>>>>>>>
>>>>>> weight(t,d)^2) ]
>>>>>>
>>>>>>> This makes the score a cosine, which since the values
>>>>>>> are
>>>>>>>
>>>> all
>>>>> positive,
>>>>>>> forces it to be in [0, 1].  The sumOfSquares()
>>>>>>>
>>>> normalization
>>>> in
>>>>> Lucene
>>>>>>> does not fully implement this.  Is there a specific
>>>>>>> reason
>>>>>>>
>>>> for
>>>>> that?
>>>>>
>>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>>>
>>>> each
>>>>> document
>>>>>> each time a document is added to the collection, since
>>>>>>
>>>> 'weight(t,d)'
>>>>> is
>>>>>> dependent on global term statistics.  This is prohibitivly
>>>>>>
>>>> expensive.
>>>>>> Research has also demonstrated that such cosine
>>>>>> normalization
>>>>>>
>>>> gives
>>>>>> somewhat inferior results (e.g., Singhal's pivoted length
>>>>>>
>>>>> normalization).
>>>>>
>>>>>>> Re. explain(), I don't see a downside to extending it
>>>>>>> show
>>>>>>>
>>>> the
>>>>> final
>>>>>>> normalization in Hits.  It could still show the raw
>>>>>>> score
>>>>>>>
>>>> just
>>>>> prior
>>>>>> to
>>>>>>> that normalization.
>>>>>>>
>>>>>> In order to normalize scores to 1.0 one must know the
>>>>>> maximum
>>>>>>
>>>> score.
>>>>>> Explain only computes the score for a single document, and
>>>>>>
>>>> the
>>>>> maximum
>>>>>> score is not known.
>>>>>>
>>>>>>> Although I think it would be best to have a
>>>>>>> normalization that would render scores comparable across
>>>>>>>
>>>> searches.
>>>>
>>>>>> Then please submit a patch.  Lucene doesn't change on its
>>>>>>
>>>> own.
>>>>
>>>>>> Doug
>>>>>>
>>>>>>
>>>> --------------------------------------------------------------
>>>> ---- --
>>>>
>>>>> -
>>>>>> To unsubscribe, e-mail:
>>>> lucene-dev-unsubscribe@jakarta.apache.org
>>>>>> For additional commands, e-mail:
>>>> lucene-dev-help@jakarta.apache.org
>>>>
>>>>
>>>> --------------------------------------------------------------
>>>> ---- --- To unsubscribe, e-mail: lucene-dev-
>>>> unsubscribe@jakarta.apache.org For additional commands, e-
>>>> mail: lucene-dev-help@jakarta.apache.org
>>>
>>>
> --------------------------------------------------------------------
>
>>> - To unsubscribe, e-mail: lucene-dev-
>>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>> lucene-dev-help@jakarta.apache.org
>>
>>
>> ------------------------------------------------------------------
>> --- To unsubscribe, e-mail: lucene-dev-
>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
>> lucene-dev-help@jakarta.apache.org
>
>
> --------------------------------------------------------------------
> - To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message