Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: local policy)
Message-ID: <4207F550.1030104@tropo.com>
Date: Mon, 07 Feb 2005 15:10:08 -0800
From: David Spencer <dave-lucene-dev@tropo.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
 rv:1.7.3) Gecko/20040910
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Re: Study Group (WAS Re: Normalized Scoring)
References: <200526101416.488718@Kelvin>
In-Reply-To: <200526101416.488718@Kelvin>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit


You might want to see a post I just made to the thread with this long 
subject:

"single field code ready - Re: URL to compare 2 Similarity's ready-- Re: 
Scoring benchmark evaluation.  Was RE: How to proceed with Bug 31841 - 
MultiSearcher problems with Similarity.docFreq() ?"


I've done an example page that compares results of searching with 
different query parsers and Similarities.


Kelvin Tan wrote:

> Wouldn't it be great if we can form a study-group of Lucene folks who want to take the "next step"? I feel uneasy posting non-Lucene specific questions to dev or user even if its related to IR.
> 
> Feels to me like there could be a couple like us, who didn't do a dissertation in IR, but would like a more indepth knowledge for practical purposes. Basically, the end result is that we are able to tune or extend lucene by using the Expert api (classes marked as Expert). Perhaps a possible outcome is a tuning tutorial for advanced users who already know how to use Lucene.
> 
> What do you think?
> 
> k
> 
> On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
> 
>> Exactly.  Luckily, since then I've learned a bit from lucene-dev
>> discussions and side IR readings, so some of the topics are making
>> more sense now.
>>
>> Otis
>>
>> --- Kelvin Tan <kelvin-lists@relevanz.com> wrote:
>>
>>
>>> Hi Otis, I was re-reading this whole theoretical thread about
>>> idf, scoring, normalization, etc from last Oct and couldn't help
>>> laughing out loud when I read your post, coz it summed up what I
>>> was thinking the whole time. I think its really great to have
>>> people like Chuck and Paul (Eshlot) around. I'm learning so much.
>>>
>>> k
>>>
>>> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
>>>
>>>
>>>> Hi Chuck,
>>>>
>>>> The relative lack of responses is not because there is no
>>>> interest, but probably because there are only a few people on
>>>> lucene-dev who can follow/understand every detail of your
>>>> proposal.  I understand and hear you, but I have a hard time
>>>> 'visualizing' some of the formulas in your proposal.  What you
>>>> are saying sounds right to me, but I don't have enough
>>>> theoretical knowledge to go one way or the other.
>>>>
>>>> Otis
>>>>
>>>>
>>>> --- Chuck Williams <chuck@manawiz.com> wrote:
>>>>
>>>>
>>>>> Hi everybody,
>>>>>
>>>>> Although there doesn't seem to be much interest in this I
>>>>> have one significant improvement to the below and a couple
>>>>> observations that clarify the situation.
>>>>>
>>>>> To illustrate the problem better normalization is intended to
>>>>> address,
>>>>> in my current application for BooleanQuery's of two terms, I
>>>>> frequently
>>>>> get a top score of 1.0 when only one of the terms is matched.
>>>>> 1.0 should indicate a "perfect match".  I'd like set my UI up
>>>>> to present the
>>>>> results differently depending on how good the different
>>>>> results are (e.g., showing a visual indication of result
>>>>> quality, dropping the really bad results entirely, and
>>>>> segregating the good results from other
>>>>> only vaguely relevant results).  To build this kind of
>>>>> "intelligence" into the UI, I certainly need to know whether
>>>>> my top result matched all
>>>>> query terms, or only half of them.  I'd like to have the
>>>>> score tell me
>>>>> directly how good the matches are.  The current normalization
>>>>> does not achieve this.
>>>>>
>>>>> The intent of a new normalization scheme is to preserve the
>>>>> current scoring in the following sense:  the ratio of the nth
>>>>> result's score to
>>>>> the best result's score remains the same.  Therefore, the
>>>>> only question
>>>>> is what normalization factor to apply to all scores.  This
>>>>> reduces to a
>>>>> very specific question that determines the entire
>>>>> normalization -- what should the score of the best matching
>>>>> result be?
>>>>>
>>>>> The mechanism below has this property, i.e. it keeps the
>>>>> current score
>>>>> ratios, except that I removed one idf term for reasons
>>>>> covered earlier
>>>>> (this better reflects the current empirically best scoring
>>>>> algorithms).
>>>>> However, removing an idf is a completely separate issue.  The
>>>>> improved
>>>>> normalization is independent of whether or not that change is
>>>>> made.
>>>>>
>>>>> For the central question of what the top score should be,
>>>>> upon reflection, I don't like the definition below.  It
>>>>> defined the top score
>>>>> as (approximately) the percentage of query terms matched by
>>>>> the top scoring result.  A better conceptual definition is to
>>>>> use a weighted average based on the boosts.  I.e., downward
>>>>> propagate all boosts to the
>>>>> underlying terms (or phrases).  Secifically, the "net boost"
>>>>> of a term
>>>>> is the product of the direct boost of the term and all boosts
>>>>> applied to
>>>>> encompassing clauses.  Then the score of the top result
>>>>> becomes the sum
>>>>> of the net boosts of its matching terms divided by the sum of
>>>>> the net boosts of all query terms.
>>>>>
>>>>> This definition is a refinement of the original proposal
>>>>> below, and it
>>>>> could probably be further refined if some impact of the tf,
>>>>> idf and/or
>>>>> lengthNorm was desired in determining the top score.  These
>>>>> other factors seems to be harder to normalize for, although
>>>>> I've thought of some simple approaches; e.g., assume the
>>>>> unmatched terms in the top result have values for these three
>>>>> factors that are the average of the
>>>>> values of the matched terms, then apply exactly the same
>>>>> concept of actual score divided by theorectical maximum
>>>>> score.  That would eliminate any need to maintain maximum
>>>>> value statistics in the index.
>>>>>
>>>>> As an example of the simple boost-based normalization, for
>>>>> the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --
>>>>> > 3 c --
>>>>>
>>>>>
>>>>>> 1 d --> 2
>>>>>>
>>>>>
>>>>> So if a and b matched, but not c and d, in the top scoring
>>>>> result, its
>>>>> score would be 0.75.  The normalizer would be 0.75/(current
>>>>> score except
>>>>> for the current normalization).  This normalizer would be
>>>>> applied to all
>>>>> current scores (minus normalization) to create the normalized
>>>>> scores.
>>>>>
>>>>> For simple query (a b), if only one of the terms matched in
>>>>> the top result, then its score would be 0.5, vs. 1.0 or many
>>>>> other possible scores today.
>>>>>
>>>>> In addition to enabling more "intelligent" UI's that
>>>>> communicate the quality of results to end-users, the proposal
>>>>> below also extends the explain() mechanism to fully explain
>>>>> the final normalized score. However, that change is also
>>>>> independent -- it could be done with the current scoring.
>>>>>
>>>>> Am I the only one who would like to see better normalization
>>>>> in Lucene? Does anybody have a better approach?
>>>>>
>>>>> If you've read this far, thanks for indulging me on this.  I
>>>>> would love
>>>>> to see this or something with similar properties in Lucene.
>>>>> I really just want to build my app, but as stated below would
>>>>> write and contribute this if there is interest in putting it
>>>>> in, and nobody else
>>>>> wants to write it.  Please let me know what you think one way
>>>>> or the other.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Chuck
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Chuck Williams
>>>>>> Sent: Monday, October 18, 2004 7:04 PM
>>>>>> To: 'Lucene Developers List'
>>>>>> Subject: RE: idf and explain(), was Re: Search and Scoring
>>>>>>
>>>>>> Doug Cutting wrote:
>>>>>>
>>>>>>> If this is a big issue for you, as it seems it is, please
>>>>>>>
>>>>>
>>>>> submit
>>>>> a
>>>>>
>>>>>> patch
>>>>>>
>>>>>>> to optionally disable score normalization in Hits.java.
>>>>>>>
>>>>>>
>>>>>> and:
>>>>>>
>>>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>>>>
>>>>>
>>>>> each
>>>>>
>>>>>> document
>>>>>>
>>>>>>> each time a document is added to the collection, since
>>>>>>>
>>>>>
>>>>> 'weight(t,d)'
>>>>>
>>>>>> is
>>>>>>
>>>>>>> dependent on global term statistics.  This is prohibitivly
>>>>>>>
>>>>>
>>>>> expensive.
>>>>>
>>>>>>> Research has also demonstrated that such cosine
>>>>>>> normalization
>>>>>>>
>>>>>
>>>>> gives
>>>>>
>>>>>>> somewhat inferior results (e.g., Singhal's pivoted length
>>>>>>>
>>>>>>
>>>>>> normalization).
>>>>>>
>>>>>> I'm willing to write, test and contribute code to address
>>>>>> the normalization issue, i.e. to yield scores in [0, 1]
>>>>>> that are
>>>>>>
>>>>>
>>>>> meaningful
>>>>>
>>>>>> across searches.  Unfortunately, this is considerably more
>>>>>>
>>>>>
>>>>> involved
>>>>> that
>>>>>
>>>>>> just optionally eliminating the current normalization in
>>>>>> Hits.
>>>>>>
>>>>>
>>>>> Before
>>>>>
>>>>>> undertaking this, I'd like to see if there is agreement in
>>>>>>
>>>>>
>>>>> principle
>>>>>
>>>>>> that this is a good idea, and that my specific proposal
>>>>>> below is
>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> right way to go.  Also, I'd like to make sure I've correctly
>>>>>>
>>>>>
>>>>> inferred
>>>>>
>>>>>> the constraints on writing code to be incorporated into
>>>>>> Lucene.
>>>>>>
>>>>>> After looking at this in more detail I agree that the
>>>>>> cosine normalization is not the way to go, because of both
>>>>>> efficiency
>>>>>>
>>>>>
>>>>> and
>>>>>
>>>>>> effectiveness considerations.  A conceptual approach that
>>>>>> would
>>>>>>
>>>>>
>>>>> be
>>>>>
>>>>>> efficient, relatively easy to implement, and seems to have
>>>>>> at
>>>>>>
>>>>>
>>>>> least
>>>>>
>>>>>> reasonable behavior would be to define the top scoring
>>>>>> match to
>>>>>>
>>>>>
>>>>> have
>>>>> a
>>>>>
>>>>>> score roughly equal to the percentage of query terms it
>>>>>> matches
>>>>>>
>>>>>
>>>>> (its
>>>>>
>>>>>> "netCoord").  Scores below the top hit would be reduced
>>>>>> based on
>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> ratio of their raw scores to the raw score of the top hit
>>>>>>
>>>>>
>>>>> (considering
>>>>>
>>>>>> all of the current score factors, except that I'd like to
>>>>>> remove
>>>>>>
>>>>>
>>>>> one
>>>>> of
>>>>>
>>>>>> the idf factors, as discussed earlier).
>>>>>>
>>>>>> For a couple simple cases:
>>>>>> a) the top match for a single term query would always have a
>>>>>>
>>>>>
>>>>> score
>>>>> of
>>>>>
>>>>>> 1.0,
>>>>>> b) the top scoring match for a BooleanQuery using
>>>>>>
>>>>>
>>>>> DefaultSimilarity
>>>>>
>>>>>> with all non-prohibited TermQuery clauses would have a
>>>>>> score of
>>>>>>
>>>>>
>>>>> m/n,
>>>>>
>>>>>> where the hit matches m of the n terms.
>>>>>>
>>>>>> This isn't optimal, but seems much better than the current
>>>>>>
>>>>>
>>>>> situation.
>>>>>
>>>>>> Consider two single-term queries, s and t.  If s matches
>>>>>> more
>>>>>>
>>>>>
>>>>> strongly
>>>>>
>>>>>> than t in its top hit (e.g., a higher tf in a shorter
>>>>>> field), it
>>>>>>
>>>>>
>>>>> would
>>>>>
>>>>>> be best if the top score of s was greater score than top
>>>>>> score of
>>>>>>
>>>>>
>>>>> t.
>>>>>
>>>>>> But this kind of normalization would require keeping
>>>>>> additional statistics that so far as I know are not
>>>>>> currently in the index,
>>>>>>
>>>>>
>>>>> like
>>>>>
>>>>>> the maximum tf for terms and the minimum length for fields.
>>>>>>
>>>>>
>>>>> These
>>>>> could
>>>>>
>>>>>> be expensive to update on deletes.  Also, normalizing by
>>>>>> such
>>>>>>
>>>>>
>>>>> factors
>>>>>
>>>>>> could yield lower than subjectively reasonable scores in
>>>>>> most
>>>>>>
>>>>>
>>>>> cases,
>>>>> so
>>>>>
>>>>>> it's not clear it would be better.
>>>>>>
>>>>>> The semantics above are at least clean, easy to understand,
>>>>>> and
>>>>>>
>>>>>
>>>>> support
>>>>>
>>>>>> what seems to me is the most important motivation to do
>>>>>> this:
>>>>>>
>>>>>
>>>>> allowing
>>>>>
>>>>>> an application to use simple thresholding to segregate
>>>>>>
>>>>>
>>>>> likely-to-be-
>>>>>
>>>>>> relevant hits from likely-to-be-irrelevant hits.
>>>>>>
>>>>>> More specifically, for a BooleanQuery of TermQuery's the
>>>>>>
>>>>>
>>>>> resulting
>>>>> score
>>>>>
>>>>>> functions would be:
>>>>>>
>>>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
>>>>>>
>>>>>> baseScore(sbq, doc) =
>>>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>>>>>> lengthNorm(tqi.term.field, doc)
>>>>>>
>>>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore
>>>>>>
>>>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
>>>>>>
>>>>>> score(sbq, doc) = rawScore * norm
>>>>>>
>>>>>> rawScore's can be computed in the Scorer.score() methods and
>>>>>>
>>>>>
>>>>> therefore
>>>>>
>>>>>> used to sort the hits (e.g., in the instance method for
>>>>>> collect()
>>>>>>
>>>>>
>>>>> in
>>>>> the
>>>>>
>>>>>> HitCollector in IndexSearcher.search()).  If the top
>>>>>> scoring hit
>>>>>>
>>>>>
>>>>> does
>>>>>
>>>>>> not have the highest baseScore, then its score could be
>>>>>> less that
>>>>>>
>>>>>
>>>>> its
>>>>>
>>>>>> coord; this seems desirable.  These formulas imply that no
>>>>>> result
>>>>>>
>>>>>
>>>>> score
>>>>>
>>>>>> can be larger than its coord, so if coord is well-defined
>>>>>> (always between 0 and 1) then all results will be
>>>>>> normalized between 0
>>>>>>
>>>>>
>>>>> and
>>>>> 1.
>>>>>
>>>>>
>>>>>> In general, the netCoord, which takes the place of coord in
>>>>>> the
>>>>>>
>>>>>
>>>>> simple
>>>>>
>>>>>> case above, needs to be defined for any query.
>>>>>> Conceptually,
>>>>>>
>>>>>
>>>>> this
>>>>>
>>>>>> should be the total percentage of query terms matched by the
>>>>>>
>>>>>
>>>>> document.
>>>>>
>>>>>> It must be recursively computable from the subquery
>>>>>> structure and overridable for specific Query types (e.g.,
>>>>>> to support
>>>>>>
>>>>>
>>>>> specialized
>>>>>
>>>>>> coords, like one that is always 1.0 as is useful in multi-
>>>>>> field searching).  Suitable default definitions for
>>>>>> TermQuery and
>>>>>>
>>>>>
>>>>> BooleanQuery
>>>>>
>>>>>> are:
>>>>>>
>>>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
>>>>>>
>>>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
>>>>>>
>>>>>
>>>>> ci.netCoord
>>>>>
>>>>>
>>>>>> This is not quite percentage of terms matched; e.g.,
>>>>>> consider a BooleanQuery with two clauses, one of which is a
>>>>>> BooleanQuery of
>>>>>>
>>>>>
>>>>> 3
>>>>> terms
>>>>>
>>>>>> and the other which is just a term.  However, it doesn't
>>>>>> seem to
>>>>>>
>>>>>
>>>>> be
>>>>>
>>>>>> unreasonable for a BooleanQuery to state that its clauses
>>>>>> are
>>>>>>
>>>>>
>>>>> equally
>>>>>
>>>>>> important, and this is consistent with the current coord
>>>>>>
>>>>>
>>>>> behavior.
>>>>>
>>>>>> BooleanQuery.netCoord could be overridden for special
>>>>>> cases, like
>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> pure disjunction I use in my app for field expansions:
>>>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
>>>>>>
>>>>>> Implementing this would proceed along these lines: 1.  For
>>>>>> backwards compatibility, add some kind of newScoring
>>>>>>
>>>>>
>>>>> boolean
>>>>>
>>>>>> setting.
>>>>>> 2.  Update all of these places to behave as indicated if
>>>>>>
>>>>>
>>>>> newScoring:
>>>>>
>>>>>> a.  Change Query.weight() to not do any normalization (no
>>>>>>
>>>>>
>>>>> call
>>>>> to
>>>>>
>>>>>> sumOfSquaredWeights(), Similarity.queryNorm() or
>>>>>> normalize()). b.  Update all Query.weight classes to set
>>>>>> their value
>>>>>>
>>>>>
>>>>> according
>>>>> to
>>>>>
>>>>>> the terms in the score formula above that don't involve the
>>>>>>
>>>>>
>>>>> document
>>>>>
>>>>>> (e.g., boost*idf for TermQuery).
>>>>>> c.  Add suitable netCoord definitions to all Scorer
>>>>>> classes. d. Update all Scorer.score() methods to compute
>>>>>> the rawScore
>>>>>>
>>>>>
>>>>> as
>>>>>
>>>>>> above.
>>>>>> e.  Add the maximum baseScore as a field kept on TopDocs,
>>>>>>
>>>>>
>>>>> computed
>>>>>
>>>>>> in the HitCollector's.
>>>>>> f.  Change the normalization in Hits to always divide every
>>>>>>
>>>>>
>>>>> raw
>>>>>
>>>>>> score by the maximum baseScore.
>>>>>> g.  Update all of the current explain() methods to be
>>>>>>
>>>>>
>>>>> consistent
>>>>>
>>>>>> with this scoring, and to either report the rawScore, or to
>>>>>>
>>>>>
>>>>> report
>>>>> the
>>>>>
>>>>>> final score if the normalization factor is provided. h.
>>>>>> Add Hits.explain() (or better extend Searcher so that it
>>>>>>
>>>>>
>>>>> keeps
>>>>>
>>>>>> the Hits for use in Searcher.explain()) to call the new
>>>>>> explain variation with the normalization factor so that
>>>>>> final scores are
>>>>>>
>>>>>
>>>>> fully
>>>>>
>>>>>> explained.
>>>>>>
>>>>>> If this seems like a good idea, please let me know.  I'm
>>>>>> sure
>>>>>>
>>>>>
>>>>> there
>>>>> are
>>>>>
>>>>>> details I've missed that would come out during coding and
>>>>>>
>>>>>
>>>>> testing.
>>>>> Also,
>>>>>
>>>>>> the value of this is dependent on how reasonable the final
>>>>>> scores
>>>>>>
>>>>>
>>>>> look,
>>>>>
>>>>>> which is hard to tell for sure until it is working.
>>>>>>
>>>>>> The coding standards for Lucene seem reasonably clear from
>>>>>> the
>>>>>>
>>>>>
>>>>> source
>>>>>
>>>>>> code I've read.  I could use just simple Java so there
>>>>>> shouldn't
>>>>>>
>>>>>
>>>>> be
>>>>> any
>>>>>
>>>>>> significant JVM dependencies.  The above should be fully
>>>>>> backward compatible due to the newScoring flag.
>>>>>>
>>>>>> On another note, I had to remove the German analyzer in my
>>>>>>
>>>>>
>>>>> current
>>>>> 1.4.2
>>>>>
>>>>>> source configuration because GermanStemmer failed to
>>>>>> compile due
>>>>>>
>>>>>
>>>>> to
>>>>> what
>>>>>
>>>>>> are apparently Unicode character constants that I've now
>>>>>> got as
>>>>>>
>>>>>
>>>>> illegal
>>>>>
>>>>>> two-character character constants.  This is presumably an
>>>>>>
>>>>>
>>>>> encoding
>>>>>
>>>>>> problem somewhere that I could track down.  It's not
>>>>>> important,
>>>>>>
>>>>>
>>>>> but
>>>>> if
>>>>>
>>>>>> the answer is obvious to any of you, I'd appreciate the
>>>>>> quick
>>>>>>
>>>>>
>>>>> tip.
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Chuck
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Doug Cutting [mailto:cutting@apache.org] Sent:
>>>>>>> Monday, October 18, 2004 9:44 AM To: Lucene Developers
>>>>>>> List Subject: Re: idf and explain(), was Re: Search and
>>>>>>> Scoring
>>>>>>>
>>>>>>> Chuck Williams wrote:
>>>>>>>
>>>>>>>> That's a good point on how the standard vector space
>>>>>>>> inner
>>>>>>>>
>>>>>
>>>>> product
>>>>>
>>>>>>>> similarity measure does imply that the idf is squared
>>>>>>>>
>>>>>
>>>>> relative
>>>>> to
>>>>>
>>>>>> the
>>>>>>
>>>>>>>> document tf.  Even having been aware of this formula
>>>>>>>> for a
>>>>>>>>
>>>>>
>>>>> long
>>>>>
>>>>>> time,
>>>>>>
>>>>>>>> this particular implication never occurred to me.  Do
>>>>>>>> you
>>>>>>>>
>>>>>
>>>>> know
>>>>> if
>>>>>
>>>>>>>> anybody has done precision/recall or other relevancy
>>>>>>>>
>>>>>
>>>>> empirical
>>>>>
>>>>>>>> measurements comparing this vs. a model that does not
>>>>>>>>
>>>>>
>>>>> square
>>>>> idf?
>>>>>
>>>>>
>>>>>>> No, not that I know of.
>>>>>>>
>>>>>>>
>>>>>>>> Regarding normalization, the normalization in Hits does
>>>>>>>> not
>>>>>>>>
>>>>>
>>>>> have
>>>>>
>>>>>> very
>>>>>>
>>>>>>>> nice properties.  Due to the > 1.0 threshold check, it
>>>>>>>>
>>>>>
>>>>> loses
>>>>>
>>>>>>>> information, and it arbitrarily defines the highest
>>>>>>>> scoring
>>>>>>>>
>>>>>
>>>>> result
>>>>>
>>>>>> in
>>>>>>
>>>>>>>> any list that generates scores above 1.0 as a perfect
>>>>>>>>
>>>>>
>>>>> match.
>>>>> It
>>>>>
>>>>>> would
>>>>>>
>>>>>>>> be nice if score values were meaningful independent of
>>>>>>>>
>>>>>
>>>>> searches,
>>>>>
>>>>>> e.g.,
>>>>>>
>>>>>>>> if "0.6" meant the same quality of retrieval
>>>>>>>> independent of
>>>>>>>>
>>>>>
>>>>> what
>>>>>
>>>>>>> search
>>>>>>>
>>>>>>>> was done.  This would allow, for example, sites to use
>>>>>>>> a a
>>>>>>>>
>>>>>
>>>>> simple
>>>>>
>>>>>>>> quality threshold to only show results that were "good
>>>>>>>>
>>>>>
>>>>> enough".
>>>>>
>>>>>> At my
>>>>>>
>>>>>>>> last company (I was President and head of engineering
>>>>>>>> for
>>>>>>>>
>>>>>
>>>>> InQuira),
>>>>>
>>>>>> we
>>>>>>
>>>>>>>> found this to be important to many customers.
>>>>>>>>
>>>>>>>
>>>>>>> If this is a big issue for you, as it seems it is, please
>>>>>>>
>>>>>
>>>>> submit
>>>>> a
>>>>>
>>>>>> patch
>>>>>>
>>>>>>> to optionally disable score normalization in Hits.java.
>>>>>>>
>>>>>>>
>>>>>>>> The standard vector space similarity measure includes
>>>>>>>>
>>>>>>
>>>>>> normalization by
>>>>>>
>>>>>>>> the product of the norms of the vectors, i.e.:
>>>>>>>>
>>>>>>>> score(d,q) =  sum over t of ( weight(t,q) * weight(t,d)
>>>>>>>> )
>>>>>>>>
>>>>>
>>>>> /
>>>>>
>>>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
>>>>>>>>
>>>>>>>
>>>>>>> weight(t,d)^2) ]
>>>>>>>
>>>>>>>
>>>>>>>> This makes the score a cosine, which since the values
>>>>>>>> are
>>>>>>>>
>>>>>
>>>>> all
>>>>>
>>>>>> positive,
>>>>>>
>>>>>>>> forces it to be in [0, 1].  The sumOfSquares()
>>>>>>>>
>>>>>
>>>>> normalization
>>>>> in
>>>>>
>>>>>> Lucene
>>>>>>
>>>>>>>> does not fully implement this.  Is there a specific
>>>>>>>> reason
>>>>>>>>
>>>>>
>>>>> for
>>>>>
>>>>>> that?
>>>>>>
>>>>>>
>>>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>>>>
>>>>>
>>>>> each
>>>>>
>>>>>> document
>>>>>>
>>>>>>> each time a document is added to the collection, since
>>>>>>>
>>>>>
>>>>> 'weight(t,d)'
>>>>>
>>>>>> is
>>>>>>
>>>>>>> dependent on global term statistics.  This is prohibitivly
>>>>>>>
>>>>>
>>>>> expensive.
>>>>>
>>>>>>> Research has also demonstrated that such cosine
>>>>>>> normalization
>>>>>>>
>>>>>
>>>>> gives
>>>>>
>>>>>>> somewhat inferior results (e.g., Singhal's pivoted length
>>>>>>>
>>>>>>
>>>>>> normalization).
>>>>>>
>>>>>>
>>>>>>>> Re. explain(), I don't see a downside to extending it
>>>>>>>> show
>>>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> final
>>>>>>
>>>>>>>> normalization in Hits.  It could still show the raw
>>>>>>>> score
>>>>>>>>
>>>>>
>>>>> just
>>>>>
>>>>>> prior
>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>> that normalization.
>>>>>>>>
>>>>>>>
>>>>>>> In order to normalize scores to 1.0 one must know the
>>>>>>> maximum
>>>>>>>
>>>>>
>>>>> score.
>>>>>
>>>>>>> Explain only computes the score for a single document, and
>>>>>>>
>>>>>
>>>>> the
>>>>>
>>>>>> maximum
>>>>>>
>>>>>>> score is not known.
>>>>>>>
>>>>>>>
>>>>>>>> Although I think it would be best to have a
>>>>>>>> normalization that would render scores comparable across
>>>>>>>>
>>>>>
>>>>> searches.
>>>>>
>>>>>
>>>>>>> Then please submit a patch.  Lucene doesn't change on its
>>>>>>>
>>>>>
>>>>> own.
>>>>>
>>>>>
>>>>>>> Doug
>>>>>>>
>>>>>>>
>>>>>
>>>>> --------------------------------------------------------------
>>>>> ---- --
>>>>>
>>>>>
>>>>>> -
>>>>>>
>>>>>>> To unsubscribe, e-mail:
>>>>>
>>>>> lucene-dev-unsubscribe@jakarta.apache.org
>>>>>
>>>>>>> For additional commands, e-mail:
>>>>>
>>>>> lucene-dev-help@jakarta.apache.org
>>>>>
>>>>>
>>>>> --------------------------------------------------------------
>>>>> ---- --- To unsubscribe, e-mail: lucene-dev-
>>>>> unsubscribe@jakarta.apache.org For additional commands, e-
>>>>> mail: lucene-dev-help@jakarta.apache.org
>>>>
>>>>
>> --------------------------------------------------------------------
>>
>>
>>>> - To unsubscribe, e-mail: lucene-dev-
>>>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>>> lucene-dev-help@jakarta.apache.org
>>>
>>>
>>> ------------------------------------------------------------------
>>> --- To unsubscribe, e-mail: lucene-dev-
>>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>> lucene-dev-help@jakarta.apache.org
>>
>>
>> --------------------------------------------------------------------
>> - To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org