lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelvin Tan <kelvin-li...@relevanz.com>
Subject RE: Study Group (WAS Re: Normalized Scoring)
Date Mon, 07 Feb 2005 18:16:19 GMT
People!!! thanks for sharing, but this is what the wiki is for!! Will everyone kindly add their
posts to http://wiki.apache.org/jakarta-lucene/InformationRetrieval for posterity?

k

On Mon, 7 Feb 2005 12:48:02 -0500, Joaquin Delgado wrote:
> A very solid (and free) online course on "intelligent information
> retrieval" with focus on practical issues can be found on Prof.
> Mooney (Univ. of Texas): http://www.cs.utexas.edu/users/mooney/ir-
> course/
>
>
> I've also copied two interesting papers (from my own private
> library -- if you are interested I've got much more to offer) for
> your reading:
>
>
> For those looking for an answer to the unsolved mystery in IR:
> http://www.triplehop.com/pdf/How_Many_Relevances_in_IR.pdf
>
> For those interesting in extending IR systems with Statistical NLP
> and Machine Learning
> http://www.triplehop.com/pdf/Text_Representation_and_ML_General_Concep
> ts .pdf
>
>
> BTW, if someone has had a look at http://www.find.com (meta-search,
> indexing and concept clustering system) I'd be interested in their
> opinion.
>
>
> -- Joaquin
>
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] Sent:
> Monday, February 07, 2005 12:15 PM To: Lucene Developers List
> Subject: Re: Study Group (WAS Re: Normalized Scoring)
>
> I think I see what you are after.  I'm after the same knowledge. :)
>
> The only things that I can recommend are books:
> Modern Information Retrieval
> Managing Gigabytes
>
> And online resources like:
> http://finance.groups.yahoo.com/group/mg/ (note the weird host
> name) http://www.sims.berkeley.edu/~hearst/irbook/
>
> There is a pile of stuff in Citeseer, but those papers never really
> dig into the details and always require solid previous knowledge of
> the field.  They are no replacement for a class or a textbook.
>
> If you find a good forum for IR, please share.
>
> Otis
>
>
> --- Kelvin Tan <kelvin-lists@relevanz.com> wrote:
>
>> Wouldn't it be great if we can form a study-group of Lucene folks
>> who want to take the "next step"? I feel uneasy posting non-
>> Lucene specific questions to dev or user even if its related to
>> IR.
>>
>> Feels to me like there could be a couple like us, who didn't do a
>> dissertation in IR, but would like a more indepth knowledge for
>> practical purposes. Basically, the end result is that we are able
>> to tune or extend lucene by using the Expert api (classes marked
>> as Expert). Perhaps a possible outcome is a tuning tutorial for
>> advanced users who already know how to use Lucene.
>>
>> What do you think?
>>
>> k
>>
>> On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
>>
>>> Exactly.  Luckily, since then I've learned a bit from lucene-
>>> dev discussions and side IR readings, so some of the topics are
>>> making more sense now.
>>>
>>> Otis
>>>
>>> --- Kelvin Tan <kelvin-lists@relevanz.com> wrote:
>>>
>>>> Hi Otis, I was re-reading this whole theoretical thread about
>>>> idf, scoring, normalization, etc from last Oct and couldn't
>>>> help laughing out loud when I read your post, coz it summed
>>>> up what I was thinking the whole time. I think its really
>>>> great to have people like Chuck and Paul (Eshlot) around. I'm
>>>> learning so much.
>>>>
>>>> k
>>>>
>>>> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic
>>>> wrote:
>>>>
>>>>> Hi Chuck,
>>>>>
>>>>> The relative lack of responses is not because there is no
>>>>> interest, but probably because there are only a few people
>>>>> on lucene-dev who can follow/understand every detail of
>>>>> your proposal.  I understand and hear you, but I have a
>>>>> hard time 'visualizing' some of the formulas in your
>>>>> proposal.  What you are saying sounds right to me, but I
>>>>> don't have enough theoretical knowledge to go one way or
>>>>> the other.
>>>>>
>>>>> Otis
>>>>>
>>>>>
>>>>> --- Chuck Williams <chuck@manawiz.com> wrote:
>>>>>
>>>>>> Hi everybody,
>>>>>>
>>>>>> Although there doesn't seem to be much interest in this I
>>>>>> have one significant improvement to the below and a
>>>>>> couple observations that clarify the situation.
>>>>>>
>>>>>> To illustrate the problem better normalization is
>>>>>> intended to address,
>>>>>> in my current application for BooleanQuery's of two
>>>>>> terms, I frequently
>>>>>> get a top score of 1.0 when only one of the terms is
>>>>>> matched. 1.0 should indicate a "perfect match".  I'd like
>>>>>> set my UI up to present the
>>>>>> results differently depending on how good the different
>>>>>> results are (e.g., showing a visual indication of result
>>>>>> quality, dropping the really bad results entirely, and
>>>>>> segregating the good results from other
>>>>>> only vaguely relevant results).  To build this kind of
>>>>>> "intelligence" into the UI, I certainly need to know
>>>>>> whether my top result matched all
>>>>>> query terms, or only half of them.  I'd like to have the
>>>>>> score tell me
>>>>>> directly how good the matches are.  The current
>>>>>> normalization does not achieve this.
>>>>>>
>>>>>> The intent of a new normalization scheme is to preserve
>>>>>> the current scoring in the following sense:  the ratio of
>>>>>> the nth result's score to
>>>>>> the best result's score remains the same.  Therefore, the
>>>>>> only question
>>>>>> is what normalization factor to apply to all scores.
>>>>>> This reduces to a
>>>>>> very specific question that determines the entire
>>>>>> normalization -- what should the score of the best
>>>>>> matching result be?
>>>>>>
>>>>>> The mechanism below has this property, i.e. it keeps the
>>>>>> current score
>>>>>> ratios, except that I removed one idf term for reasons
>>>>>> covered earlier
>>>>>> (this better reflects the current empirically best
>>>>>> scoring algorithms).
>>>>>> However, removing an idf is a completely separate issue.
>>>>>> The improved
>>>>>> normalization is independent of whether or not that
>>>>>> change is made.
>>>>>>
>>>>>> For the central question of what the top score should be,
>>>>>> upon reflection, I don't like the definition below.  It
>>>>>> defined the top score
>>>>>> as (approximately) the percentage of query terms matched
>>>>>> by the top scoring result.  A better conceptual
>>>>>> definition is to use a weighted average based on the
>>>>>> boosts.  I.e., downward propagate all boosts to the
>>>>>> underlying terms (or phrases).  Secifically, the "net
>>>>>> boost" of a term
>>>>>> is the product of the direct boost of the term and all
>>>>>> boosts applied to
>>>>>> encompassing clauses.  Then the score of the top result
>>>>>> becomes the sum
>>>>>> of the net boosts of its matching terms divided by the
>>>>>> sum of the net boosts of all query terms.
>>>>>>
>>>>>> This definition is a refinement of the original proposal
>>>>>> below, and it
>>>>>> could probably be further refined if some impact of the
>>>>>> tf, idf and/or
>>>>>> lengthNorm was desired in determining the top score.
>>>>>> These other factors seems to be harder to normalize for,
>>>>>> although I've thought of some simple approaches; e.g.,
>>>>>> assume the unmatched terms in the top result have values
>>>>>> for these three factors that are the average of the
>>>>>> values of the matched terms, then apply exactly the same
>>>>>> concept of actual score divided by theorectical maximum
>>>>>> score.  That would eliminate any need to maintain maximum
>>>>>> value statistics in the index.
>>>>>>
>>>>>> As an example of the simple boost-based normalization,
>>>>>> for the query ((a^2 b)^3 (c d^2)) the net boosts are: a --
>>>>>> > 6 b --
>>>>>>
>>>>>>> 3 c --
>>>>>>>
>>>>>>> 1 d --> 2
>>>>>>>
>>>>>> So if a and b matched, but not c and d, in the top
>>>>>> scoring result, its
>>>>>> score would be 0.75.  The normalizer would be
>>>>>> 0.75/(current score except
>>>>>> for the current normalization).  This normalizer would be
>>>>>> applied to all
>>>>>> current scores (minus normalization) to create the
>>>>>> normalized scores.
>>>>>>
>>>>>> For simple query (a b), if only one of the terms matched
>>>>>> in the top result, then its score would be 0.5, vs. 1.0
>>>>>> or many other possible scores today.
>>>>>>
>>>>>> In addition to enabling more "intelligent" UI's that
>>>>>> communicate the quality of results to end-users, the
>>>>>> proposal below also extends the explain() mechanism to
>>>>>> fully explain the final normalized score. However, that
>>>>>> change is also independent -- it could be done with the
>>>>>> current scoring.
>>>>>>
>>>>>> Am I the only one who would like to see better
>>>>>> normalization in Lucene? Does anybody have a better
>>>>>> approach?
>>>>>>
>>>>>> If you've read this far, thanks for indulging me on this.
>>>>>>  I would love
>>>>>> to see this or something with similar properties in
>>>>>> Lucene. I really just want to build my app, but as stated
>>>>>> below would write and contribute this if there is
>>>>>> interest in putting it in, and nobody else
>>>>>> wants to write it.  Please let me know what you think one
>>>>>> way or the other.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Chuck
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Chuck Williams
>>>>>>> Sent: Monday, October 18, 2004 7:04 PM
>>>>>>> To: 'Lucene Developers List'
>>>>>>> Subject: RE: idf and explain(), was Re: Search and
>>>>>>> Scoring
>>>>>>>
>>>>>>> Doug Cutting wrote:
>>>>>>>> If this is a big issue for you, as it seems it is,
>>>>>>>> please
>>>>>>>>
>>>>>> submit
>>>>>> a
>>>>>>> patch
>>>>>>>> to optionally disable score normalization in
>>>>>>>> Hits.java.
>>>>>>>>
>>>>>>> and:
>>>>>>>> The quantity 'sum(t) weight(t,d)^2' must be
>>>>>>>> recomputed for
>>>>>>>>
>>>>>> each
>>>>>>> document
>>>>>>>> each time a document is added to the collection, since
>>>>>>>>
>>>>>> 'weight(t,d)'
>>>>>>> is
>>>>>>>> dependent on global term statistics.  This is
>>>>>>>> prohibitivly
>>>>>>>>
>>>>>> expensive.
>>>>>>>> Research has also demonstrated that such cosine
>>>>>>>> normalization
>>>>>>>>
>>>>>> gives
>>>>>>>> somewhat inferior results (e.g., Singhal's pivoted
>>>>>>>> length
>>>>>>>>
>>>>>>> normalization).
>>>>>>>
>>>>>>> I'm willing to write, test and contribute code to
>>>>>>> address the normalization issue, i.e. to yield scores
>>>>>>> in [0, 1] that are
>>>>>>>
>>>>>> meaningful
>>>>>>> across searches.  Unfortunately, this is considerably
>>>>>>> more
>>>>>>>
>>>>>> involved
>>>>>> that
>>>>>>> just optionally eliminating the current normalization
>>>>>>> in Hits.
>>>>>>>
>>>>>> Before
>>>>>>> undertaking this, I'd like to see if there is agreement
>>>>>>> in
>>>>>>>
>>>>>> principle
>>>>>>> that this is a good idea, and that my specific proposal
>>>>>>> below is
>>>>>>>
>>>>>> the
>>>>>>> right way to go.  Also, I'd like to make sure I've
>>>>>>> correctly
>>>>>>>
>>>>>> inferred
>>>>>>> the constraints on writing code to be incorporated into
>>>>>>> Lucene.
>>>>>>>
>>>>>>> After looking at this in more detail I agree that the
>>>>>>> cosine normalization is not the way to go, because of
>>>>>>> both efficiency
>>>>>>>
>>>>>> and
>>>>>>> effectiveness considerations.  A conceptual approach
>>>>>>> that would
>>>>>>>
>>>>>> be
>>>>>>> efficient, relatively easy to implement, and seems to
>>>>>>> have at
>>>>>>>
>>>>>> least
>>>>>>> reasonable behavior would be to define the top scoring
>>>>>>> match to
>>>>>>>
>>>>>> have
>>>>>> a
>>>>>>> score roughly equal to the percentage of query terms it
>>>>>>> matches
>>>>>>>
>>>>>> (its
>>>>>>> "netCoord").  Scores below the top hit would be reduced
>>>>>>> based on
>>>>>>>
>>>>>> the
>>>>>>> ratio of their raw scores to the raw score of the top
>>>>>>> hit
>>>>>>>
>>>>>> (considering
>>>>>>> all of the current score factors, except that I'd like
>>>>>>> to remove
>>>>>>>
>>>>>> one
>>>>>> of
>>>>>>> the idf factors, as discussed earlier).
>>>>>>>
>>>>>>> For a couple simple cases:
>>>>>>> a) the top match for a single term query would always
>>>>>>> have a
>>>>>>>
>>>>>> score
>>>>>> of
>>>>>>> 1.0,
>>>>>>> b) the top scoring match for a BooleanQuery using
>>>>>>>
>>>>>> DefaultSimilarity
>>>>>>> with all non-prohibited TermQuery clauses would have a
>>>>>>> score of
>>>>>>>
>>>>>> m/n,
>>>>>>> where the hit matches m of the n terms.
>>>>>>>
>>>>>>> This isn't optimal, but seems much better than the
>>>>>>> current
>>>>>>>
>>>>>> situation.
>>>>>>> Consider two single-term queries, s and t.  If s
>>>>>>> matches more
>>>>>>>
>>>>>> strongly
>>>>>>> than t in its top hit (e.g., a higher tf in a shorter
>>>>>>> field), it
>>>>>>>
>>>>>> would
>>>>>>> be best if the top score of s was greater score than
>>>>>>> top score of
>>>>>>>
>>>>>> t.
>>>>>>> But this kind of normalization would require keeping
>>>>>>> additional statistics that so far as I know are not
>>>>>>> currently in the index,
>>>>>>>
>>>>>> like
>>>>>>> the maximum tf for terms and the minimum length for
>>>>>>> fields.
>>>>>>>
>>>>>> These
>>>>>> could
>>>>>>> be expensive to update on deletes.  Also, normalizing
>>>>>>> by such
>>>>>>>
>>>>>> factors
>>>>>>> could yield lower than subjectively reasonable scores
>>>>>>> in most
>>>>>>>
>>>>>> cases,
>>>>>> so
>>>>>>> it's not clear it would be better.
>>>>>>>
>>>>>>> The semantics above are at least clean, easy to
>>>>>>> understand, and
>>>>>>>
>>>>>> support
>>>>>>> what seems to me is the most important motivation to do
>>>>>>> this:
>>>>>>>
>>>>>> allowing
>>>>>>> an application to use simple thresholding to segregate
>>>>>>>
>>>>>> likely-to-be-
>>>>>>> relevant hits from likely-to-be-irrelevant hits.
>>>>>>>
>>>>>>> More specifically, for a BooleanQuery of TermQuery's the
>>>>>>>
>>>>>> resulting
>>>>>> score
>>>>>>> functions would be:
>>>>>>>
>>>>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
>>>>>>>
>>>>>>> baseScore(sbq, doc) =
>>>>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>>>>>>> lengthNorm(tqi.term.field, doc)
>>>>>>>
>>>>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore
>>>>>>>
>>>>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq,
>>>>>>> hit)
>>>>>>>
>>>>>>> score(sbq, doc) = rawScore * norm
>>>>>>>
>>>>>>> rawScore's can be computed in the Scorer.score()
>>>>>>> methods and
>>>>>>>
>>>>>> therefore
>>>>>>> used to sort the hits (e.g., in the instance method for
>>>>>>> collect()
>>>>>>>
>>>>>> in
>>>>>> the
>>>>>>> HitCollector in IndexSearcher.search()).  If the top
>>>>>>> scoring hit
>>>>>>>
>>>>>> does
>>>>>>> not have the highest baseScore, then its score could be
>>>>>>> less that
>>>>>>>
>>>>>> its
>>>>>>> coord; this seems desirable.  These formulas imply that
>>>>>>> no result
>>>>>>>
>>>>>> score
>>>>>>> can be larger than its coord, so if coord is well-
>>>>>>> defined (always between 0 and 1) then all results will
>>>>>>> be normalized between 0
>>>>>>>
>>>>>> and
>>>>>> 1.
>>>>>>
>>>>>>> In general, the netCoord, which takes the place of
>>>>>>> coord in the
>>>>>>>
>>>>>> simple
>>>>>>> case above, needs to be defined for any query.
>>>>>>> Conceptually,
>>>>>>>
>>>>>> this
>>>>>>> should be the total percentage of query terms matched
>>>>>>> by the
>>>>>>>
>>>>>> document.
>>>>>>> It must be recursively computable from the subquery
>>>>>>> structure and overridable for specific Query types
>>>>>>> (e.g., to support
>>>>>>>
>>>>>> specialized
>>>>>>> coords, like one that is always 1.0 as is useful in
>>>>>>> multi- field searching).  Suitable default definitions
>>>>>>> for TermQuery and
>>>>>>>
>>>>>> BooleanQuery
>>>>>>> are:
>>>>>>>
>>>>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
>>>>>>>
>>>>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
>>>>>>>
>>>>>> ci.netCoord
>>>>>>
>>>>>>> This is not quite percentage of terms matched; e.g.,
>>>>>>> consider a BooleanQuery with two clauses, one of which
>>>>>>> is a BooleanQuery of
>>>>>>>
>>>>>> 3
>>>>>> terms
>>>>>>> and the other which is just a term.  However, it
>>>>>>> doesn't seem to
>>>>>>>
>>>>>> be
>>>>>>> unreasonable for a BooleanQuery to state that its
>>>>>>> clauses are
>>>>>>>
>>>>>> equally
>>>>>>> important, and this is consistent with the current coord
>>>>>>>
>>>>>> behavior.
>>>>>>> BooleanQuery.netCoord could be overridden for special
>>>>>>> cases, like
>>>>>>>
>>>>>> the
>>>>>>> pure disjunction I use in my app for field expansions:
>>>>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci)
>>>>>>> ci.netCoord
>>>>>>>
>>>>>>> Implementing this would proceed along these lines: 1.
>>>>>>> For backwards compatibility, add some kind of newScoring
>>>>>>>
>>>>>> boolean
>>>>>>> setting.
>>>>>>> 2.  Update all of these places to behave as indicated if
>>>>>>>
>>>>>> newScoring:
>>>>>>> a.  Change Query.weight() to not do any normalization
>>>>>>> (no
>>>>>>>
>>>>>> call
>>>>>> to
>>>>>>> sumOfSquaredWeights(), Similarity.queryNorm() or
>>>>>>> normalize()). b.  Update all Query.weight classes to
>>>>>>> set their value
>>>>>>>
>>>>>> according
>>>>>> to
>>>>>>> the terms in the score formula above that don't involve
>>>>>>> the
>>>>>>>
>>>>>> document
>>>>>>> (e.g., boost*idf for TermQuery).
>>>>>>> c.  Add suitable netCoord definitions to all Scorer
>>>>>>> classes. d. Update all Scorer.score() methods to
>>>>>>> compute the rawScore
>>>>>>>
>>>>>> as
>>>>>>> above.
>>>>>>> e.  Add the maximum baseScore as a field kept on
>>>>>>> TopDocs,
>>>>>>>
>>>>>> computed
>>>>>>> in the HitCollector's.
>>>>>>> f.  Change the normalization in Hits to always divide
>>>>>>> every
>>>>>>>
>>>>>> raw
>>>>>>> score by the maximum baseScore.
>>>>>>> g.  Update all of the current explain() methods to be
>>>>>>>
>>>>>> consistent
>>>>>>> with this scoring, and to either report the rawScore,
>>>>>>> or to
>>>>>>>
>>>>>> report
>>>>>> the
>>>>>>> final score if the normalization factor is provided. h.
>>>>>>> Add Hits.explain() (or better extend Searcher so that it
>>>>>>>
>>>>>> keeps
>>>>>>> the Hits for use in Searcher.explain()) to call the new
>>>>>>> explain variation with the normalization factor so that
>>>>>>> final scores are
>>>>>>>
>>>>>> fully
>>>>>>> explained.
>>>>>>>
>>>>>>> If this seems like a good idea, please let me know.
>>>>>>> I'm sure
>>>>>>>
>>>>>> there
>>>>>> are
>>>>>>> details I've missed that would come out during coding
>>>>>>> and
>>>>>>>
>>>>>> testing.
>>>>>> Also,
>>>>>>> the value of this is dependent on how reasonable the
>>>>>>> final scores
>>>>>>>
>>>>>> look,
>>>>>>> which is hard to tell for sure until it is working.
>>>>>>>
>>>>>>> The coding standards for Lucene seem reasonably clear
>>>>>>> from the
>>>>>>>
>>>>>> source
>>>>>>> code I've read.  I could use just simple Java so there
>>>>>>> shouldn't
>>>>>>>
>>>>>> be
>>>>>> any
>>>>>>> significant JVM dependencies.  The above should be
>>>>>>> fully backward compatible due to the newScoring flag.
>>>>>>>
>>>>>>> On another note, I had to remove the German analyzer in
>>>>>>> my
>>>>>>>
>>>>>> current
>>>>>> 1.4.2
>>>>>>> source configuration because GermanStemmer failed to
>>>>>>> compile due
>>>>>>>
>>>>>> to
>>>>>> what
>>>>>>> are apparently Unicode character constants that I've
>>>>>>> now got as
>>>>>>>
>>>>>> illegal
>>>>>>> two-character character constants.  This is presumably
>>>>>>> an
>>>>>>>
>>>>>> encoding
>>>>>>> problem somewhere that I could track down.  It's not
>>>>>>> important,
>>>>>>>
>>>>>> but
>>>>>> if
>>>>>>> the answer is obvious to any of you, I'd appreciate the
>>>>>>> quick
>>>>>>>
>>>>>> tip.
>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Chuck
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Doug Cutting [mailto:cutting@apache.org] Sent:
>>>>>>>> Monday, October 18, 2004 9:44 AM To: Lucene
>>>>>>>> Developers List Subject: Re: idf and explain(), was
>>>>>>>> Re: Search and Scoring
>>>>>>>>
>>>>>>>> Chuck Williams wrote:
>>>>>>>>> That's a good point on how the standard vector
>>>>>>>>> space inner
>>>>>>>>>
>>>>>> product
>>>>>>>>> similarity measure does imply that the idf is
>>>>>>>>> squared
>>>>>>>>>
>>>>>> relative
>>>>>> to
>>>>>>> the
>>>>>>>>> document tf.  Even having been aware of this
>>>>>>>>> formula for a
>>>>>>>>>
>>>>>> long
>>>>>>> time,
>>>>>>>>> this particular implication never occurred to me.
>>>>>>>>> Do you
>>>>>>>>>
>>>>>> know
>>>>>> if
>>>>>>>>> anybody has done precision/recall or other relevancy
>>>>>>>>>
>>>>>> empirical
>>>>>>>>> measurements comparing this vs. a model that does
>>>>>>>>> not
>>>>>>>>>
>>>>>> square
>>>>>> idf?
>>>>>>
>>>>>>>> No, not that I know of.
>>>>>>>>
>>>>>>>>> Regarding normalization, the normalization in Hits
>>>>>>>>> does not
>>>>>>>>>
>>>>>> have
>>>>>>> very
>>>>>>>>> nice properties.  Due to the > 1.0 threshold check,
>>>>>>>>> it
>>>>>>>>>
>>>>>> loses
>>>>>>>>> information, and it arbitrarily defines the highest
>>>>>>>>> scoring
>>>>>>>>>
>>>>>> result
>>>>>>> in
>>>>>>>>> any list that generates scores above 1.0 as a
>>>>>>>>> perfect
>>>>>>>>>
>>>>>> match.
>>>>>> It
>>>>>>> would
>>>>>>>>> be nice if score values were meaningful independent
>>>>>>>>> of
>>>>>>>>>
>>>>>> searches,
>>>>>>> e.g.,
>>>>>>>>> if "0.6" meant the same quality of retrieval
>>>>>>>>> independent of
>>>>>>>>>
>>>>>> what
>>>>>>>> search
>>>>>>>>> was done.  This would allow, for example, sites to
>>>>>>>>> use a a
>>>>>>>>>
>>>>>> simple
>>>>>>>>> quality threshold to only show results that were
>>>>>>>>> "good
>>>>>>>>>
>>>>>> enough".
>>>>>>> At my
>>>>>>>>> last company (I was President and head of
>>>>>>>>> engineering for
>>>>>>>>>
>>>>>> InQuira),
>>>>>>> we
>>>>>>>>> found this to be important to many customers.
>>>>>>>>>
>>>>>>>> If this is a big issue for you, as it seems it is,
>>>>>>>> please
>>>>>>>>
>>>>>> submit
>>>>>> a
>>>>>>> patch
>>>>>>>> to optionally disable score normalization in
>>>>>>>> Hits.java.
>>>>>>>>
>>>>>>>>> The standard vector space similarity measure
>>>>>>>>> includes
>>>>>>>>>
>>>>>>> normalization by
>>>>>>>>> the product of the norms of the vectors, i.e.:
>>>>>>>>>
>>>>>>>>> score(d,q) =  sum over t of ( weight(t,q) *
>>>>>>>>> weight(t,d) )
>>>>>>>>>
>>>>>> /
>>>>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
>>>>>>>>>
>>>>>>>> weight(t,d)^2) ]
>>>>>>>>
>>>>>>>>> This makes the score a cosine, which since the
>>>>>>>>> values are
>>>>>>>>>
>>>>>> all
>>>>>>> positive,
>>>>>>>>> forces it to be in [0, 1].  The sumOfSquares()
>>>>>>>>>
>>>>>> normalization
>>>>>> in
>>>>>>> Lucene
>>>>>>>>> does not fully implement this.  Is there a specific
>>>>>>>>> reason
>>>>>>>>>
>>>>>> for
>>>>>>> that?
>>>>>>>
>>>>>>>> The quantity 'sum(t) weight(t,d)^2' must be
>>>>>>>> recomputed for
>>>>>>>>
>>>>>> each
>>>>>>> document
>>>>>>>> each time a document is added to the collection, since
>>>>>>>>
>>>>>> 'weight(t,d)'
>>>>>>> is
>>>>>>>> dependent on global term statistics.  This is
>>>>>>>> prohibitivly
>>>>>>>>
>>>>>> expensive.
>>>>>>>> Research has also demonstrated that such cosine
>>>>>>>> normalization
>>>>>>>>
>>>>>> gives
>>>>>>>> somewhat inferior results (e.g., Singhal's pivoted
>>>>>>>> length
>>>>>>>>
>>>>>>> normalization).
>>>>>>>
>>>>>>>>> Re. explain(), I don't see a downside to extending
>>>>>>>>> it show
>>>>>>>>>
>>>>>> the
>>>>>>> final
>>>>>>>>> normalization in Hits.  It could still show the raw
>>>>>>>>> score
>>>>>>>>>
>>>>>> just
>>>>>>> prior
>>>>>>>> to
>>>>>>>>> that normalization.
>>>>>>>>>
>>>>>>>> In order to normalize scores to 1.0 one must know the
>>>>>>>> maximum
>>>>>>>>
>>>>>> score.
>>>>>>>> Explain only computes the score for a single
>>>>>>>> document, and
>>>>>>>>
>>>>>> the
>>>>>>> maximum
>>>>>>>> score is not known.
>>>>>>>>
>>>>>>>>> Although I think it would be best to have a
>>>>>>>>> normalization that would render scores comparable
>>>>>>>>> across
>>>>>>>>>
>>>>>> searches.
>>>>>>
>>>>>>>> Then please submit a patch.  Lucene doesn't change on
>>>>>>>> its
>>>>>>>>
>>>>>> own.
>>>>>>
>>>>>>>> Doug
>>>>>>>>
>>>>>>>>
>>>>>> ----------------------------------------------------------
>>>>>> ---- ---- --
>>>>>>
>>>>>>> -
>>>>>>>> To unsubscribe, e-mail:
>>>>>> lucene-dev-unsubscribe@jakarta.apache.org
>>>>>>>> For additional commands, e-mail:
>>>>>> lucene-dev-help@jakarta.apache.org
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------
>>>>>> ---- ---- --- To unsubscribe, e-mail: lucene-dev-
>>>>>> unsubscribe@jakarta.apache.org For additional commands, e-
>>>>>>  mail: lucene-dev-help@jakarta.apache.org
>>>>>
>>>
> --------------------------------------------------------------------
>
>>>>> - To unsubscribe, e-mail: lucene-dev-
>>>>> unsubscribe@jakarta.apache.org For additional commands, e-
>>>>> mail: lucene-dev-help@jakarta.apache.org
>>>>
>>>>
>>>> --------------------------------------------------------------
>>>> ---- --- To unsubscribe, e-mail: lucene-dev-
>>>> unsubscribe@jakarta.apache.org For additional commands, e-
>>>> mail: lucene-dev-help@jakarta.apache.org
>>>
>>>
> --------------------------------------------------------------------
>
>>> - To unsubscribe, e-mail: lucene-dev-
>>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>> lucene-dev-help@jakarta.apache.org
>>
>>
>> ------------------------------------------------------------------
>> --- To unsubscribe, e-mail: lucene-dev-
>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
>> lucene-dev-help@jakarta.apache.org
>
>
> --------------------------------------------------------------------
> - To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
> --------------------------------------------------------------------
> - To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message