lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joaquin Delgado" <joaq...@triplehop.com>
Subject RE: Study Group (WAS Re: Normalized Scoring)
Date Mon, 07 Feb 2005 17:48:02 GMT
A very solid (and free) online course on "intelligent information
retrieval" with focus on practical issues can be found on Prof. Mooney
(Univ. of Texas):
http://www.cs.utexas.edu/users/mooney/ir-course/


I've also copied two interesting papers (from my own private library --
if you are interested I've got much more to offer) for your reading:


For those looking for an answer to the unsolved mystery in IR:
http://www.triplehop.com/pdf/How_Many_Relevances_in_IR.pdf

For those interesting in extending IR systems with Statistical NLP and
Machine Learning
http://www.triplehop.com/pdf/Text_Representation_and_ML_General_Concepts
.pdf


BTW, if someone has had a look at http://www.find.com (meta-search,
indexing and concept clustering system) I'd be interested in their
opinion.


-- Joaquin




-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Monday, February 07, 2005 12:15 PM
To: Lucene Developers List
Subject: Re: Study Group (WAS Re: Normalized Scoring)

I think I see what you are after.  I'm after the same knowledge. :)

The only things that I can recommend are books:
  Modern Information Retrieval
  Managing Gigabytes

And online resources like:
  http://finance.groups.yahoo.com/group/mg/ (note the weird host name)
  http://www.sims.berkeley.edu/~hearst/irbook/

There is a pile of stuff in Citeseer, but those papers never really dig
into the details and always require solid previous knowledge of the
field.  They are no replacement for a class or a textbook.

If you find a good forum for IR, please share.

Otis


--- Kelvin Tan <kelvin-lists@relevanz.com> wrote:

> Wouldn't it be great if we can form a study-group of Lucene folks who
> want to take the "next step"? I feel uneasy posting non-Lucene
> specific questions to dev or user even if its related to IR.
> 
> Feels to me like there could be a couple like us, who didn't do a
> dissertation in IR, but would like a more indepth knowledge for
> practical purposes. Basically, the end result is that we are able to
> tune or extend lucene by using the Expert api (classes marked as
> Expert). Perhaps a possible outcome is a tuning tutorial for advanced
> users who already know how to use Lucene.
> 
> What do you think?
> 
> k
> 
> On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
> > Exactly.  Luckily, since then I've learned a bit from lucene-dev
> > discussions and side IR readings, so some of the topics are making
> > more sense now.
> >
> > Otis
> >
> > --- Kelvin Tan <kelvin-lists@relevanz.com> wrote:
> >
> >> Hi Otis, I was re-reading this whole theoretical thread about
> >> idf, scoring, normalization, etc from last Oct and couldn't help
> >> laughing out loud when I read your post, coz it summed up what I
> >> was thinking the whole time. I think its really great to have
> >> people like Chuck and Paul (Eshlot) around. I'm learning so much.
> >>
> >> k
> >>
> >> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
> >>
> >>> Hi Chuck,
> >>>
> >>> The relative lack of responses is not because there is no
> >>> interest, but probably because there are only a few people on
> >>> lucene-dev who can follow/understand every detail of your
> >>> proposal.  I understand and hear you, but I have a hard time
> >>> 'visualizing' some of the formulas in your proposal.  What you
> >>> are saying sounds right to me, but I don't have enough
> >>> theoretical knowledge to go one way or the other.
> >>>
> >>> Otis
> >>>
> >>>
> >>> --- Chuck Williams <chuck@manawiz.com> wrote:
> >>>
> >>>> Hi everybody,
> >>>>
> >>>> Although there doesn't seem to be much interest in this I
> >>>> have one significant improvement to the below and a couple
> >>>> observations that clarify the situation.
> >>>>
> >>>> To illustrate the problem better normalization is intended to
> >>>> address,
> >>>> in my current application for BooleanQuery's of two terms, I
> >>>> frequently
> >>>> get a top score of 1.0 when only one of the terms is matched.
> >>>> 1.0 should indicate a "perfect match".  I'd like set my UI up
> >>>> to present the
> >>>> results differently depending on how good the different
> >>>> results are (e.g., showing a visual indication of result
> >>>> quality, dropping the really bad results entirely, and
> >>>> segregating the good results from other
> >>>> only vaguely relevant results).  To build this kind of
> >>>> "intelligence" into the UI, I certainly need to know whether
> >>>> my top result matched all
> >>>> query terms, or only half of them.  I'd like to have the
> >>>> score tell me
> >>>> directly how good the matches are.  The current normalization
> >>>> does not achieve this.
> >>>>
> >>>> The intent of a new normalization scheme is to preserve the
> >>>> current scoring in the following sense:  the ratio of the nth
> >>>> result's score to
> >>>> the best result's score remains the same.  Therefore, the
> >>>> only question
> >>>> is what normalization factor to apply to all scores.  This
> >>>> reduces to a
> >>>> very specific question that determines the entire
> >>>> normalization -- what should the score of the best matching
> >>>> result be?
> >>>>
> >>>> The mechanism below has this property, i.e. it keeps the
> >>>> current score
> >>>> ratios, except that I removed one idf term for reasons
> >>>> covered earlier
> >>>> (this better reflects the current empirically best scoring
> >>>> algorithms).
> >>>> However, removing an idf is a completely separate issue.  The
> >>>> improved
> >>>> normalization is independent of whether or not that change is
> >>>> made.
> >>>>
> >>>> For the central question of what the top score should be,
> >>>> upon reflection, I don't like the definition below.  It
> >>>> defined the top score
> >>>> as (approximately) the percentage of query terms matched by
> >>>> the top scoring result.  A better conceptual definition is to
> >>>> use a weighted average based on the boosts.  I.e., downward
> >>>> propagate all boosts to the
> >>>> underlying terms (or phrases).  Secifically, the "net boost"
> >>>> of a term
> >>>> is the product of the direct boost of the term and all boosts
> >>>> applied to
> >>>> encompassing clauses.  Then the score of the top result
> >>>> becomes the sum
> >>>> of the net boosts of its matching terms divided by the sum of
> >>>> the net boosts of all query terms.
> >>>>
> >>>> This definition is a refinement of the original proposal
> >>>> below, and it
> >>>> could probably be further refined if some impact of the tf,
> >>>> idf and/or
> >>>> lengthNorm was desired in determining the top score.  These
> >>>> other factors seems to be harder to normalize for, although
> >>>> I've thought of some simple approaches; e.g., assume the
> >>>> unmatched terms in the top result have values for these three
> >>>> factors that are the average of the
> >>>> values of the matched terms, then apply exactly the same
> >>>> concept of actual score divided by theorectical maximum
> >>>> score.  That would eliminate any need to maintain maximum
> >>>> value statistics in the index.
> >>>>
> >>>> As an example of the simple boost-based normalization, for
> >>>> the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --
> >>>> > 3 c --
> >>>>
> >>>>> 1 d --> 2
> >>>>>
> >>>> So if a and b matched, but not c and d, in the top scoring
> >>>> result, its
> >>>> score would be 0.75.  The normalizer would be 0.75/(current
> >>>> score except
> >>>> for the current normalization).  This normalizer would be
> >>>> applied to all
> >>>> current scores (minus normalization) to create the normalized
> >>>> scores.
> >>>>
> >>>> For simple query (a b), if only one of the terms matched in
> >>>> the top result, then its score would be 0.5, vs. 1.0 or many
> >>>> other possible scores today.
> >>>>
> >>>> In addition to enabling more "intelligent" UI's that
> >>>> communicate the quality of results to end-users, the proposal
> >>>> below also extends the explain() mechanism to fully explain
> >>>> the final normalized score. However, that change is also
> >>>> independent -- it could be done with the current scoring.
> >>>>
> >>>> Am I the only one who would like to see better normalization
> >>>> in Lucene? Does anybody have a better approach?
> >>>>
> >>>> If you've read this far, thanks for indulging me on this.  I
> >>>> would love
> >>>> to see this or something with similar properties in Lucene.
> >>>> I really just want to build my app, but as stated below would
> >>>> write and contribute this if there is interest in putting it
> >>>> in, and nobody else
> >>>> wants to write it.  Please let me know what you think one way
> >>>> or the other.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Chuck
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Chuck Williams
> >>>>> Sent: Monday, October 18, 2004 7:04 PM
> >>>>> To: 'Lucene Developers List'
> >>>>> Subject: RE: idf and explain(), was Re: Search and Scoring
> >>>>>
> >>>>> Doug Cutting wrote:
> >>>>>> If this is a big issue for you, as it seems it is, please
> >>>>>>
> >>>> submit
> >>>> a
> >>>>> patch
> >>>>>> to optionally disable score normalization in Hits.java.
> >>>>>>
> >>>>> and:
> >>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
> >>>>>>
> >>>> each
> >>>>> document
> >>>>>> each time a document is added to the collection, since
> >>>>>>
> >>>> 'weight(t,d)'
> >>>>> is
> >>>>>> dependent on global term statistics.  This is prohibitivly
> >>>>>>
> >>>> expensive.
> >>>>>> Research has also demonstrated that such cosine
> >>>>>> normalization
> >>>>>>
> >>>> gives
> >>>>>> somewhat inferior results (e.g., Singhal's pivoted length
> >>>>>>
> >>>>> normalization).
> >>>>>
> >>>>> I'm willing to write, test and contribute code to address
> >>>>> the normalization issue, i.e. to yield scores in [0, 1]
> >>>>> that are
> >>>>>
> >>>> meaningful
> >>>>> across searches.  Unfortunately, this is considerably more
> >>>>>
> >>>> involved
> >>>> that
> >>>>> just optionally eliminating the current normalization in
> >>>>> Hits.
> >>>>>
> >>>> Before
> >>>>> undertaking this, I'd like to see if there is agreement in
> >>>>>
> >>>> principle
> >>>>> that this is a good idea, and that my specific proposal
> >>>>> below is
> >>>>>
> >>>> the
> >>>>> right way to go.  Also, I'd like to make sure I've correctly
> >>>>>
> >>>> inferred
> >>>>> the constraints on writing code to be incorporated into
> >>>>> Lucene.
> >>>>>
> >>>>> After looking at this in more detail I agree that the
> >>>>> cosine normalization is not the way to go, because of both
> >>>>> efficiency
> >>>>>
> >>>> and
> >>>>> effectiveness considerations.  A conceptual approach that
> >>>>> would
> >>>>>
> >>>> be
> >>>>> efficient, relatively easy to implement, and seems to have
> >>>>> at
> >>>>>
> >>>> least
> >>>>> reasonable behavior would be to define the top scoring
> >>>>> match to
> >>>>>
> >>>> have
> >>>> a
> >>>>> score roughly equal to the percentage of query terms it
> >>>>> matches
> >>>>>
> >>>> (its
> >>>>> "netCoord").  Scores below the top hit would be reduced
> >>>>> based on
> >>>>>
> >>>> the
> >>>>> ratio of their raw scores to the raw score of the top hit
> >>>>>
> >>>> (considering
> >>>>> all of the current score factors, except that I'd like to
> >>>>> remove
> >>>>>
> >>>> one
> >>>> of
> >>>>> the idf factors, as discussed earlier).
> >>>>>
> >>>>> For a couple simple cases:
> >>>>> a) the top match for a single term query would always have a
> >>>>>
> >>>> score
> >>>> of
> >>>>> 1.0,
> >>>>> b) the top scoring match for a BooleanQuery using
> >>>>>
> >>>> DefaultSimilarity
> >>>>> with all non-prohibited TermQuery clauses would have a
> >>>>> score of
> >>>>>
> >>>> m/n,
> >>>>> where the hit matches m of the n terms.
> >>>>>
> >>>>> This isn't optimal, but seems much better than the current
> >>>>>
> >>>> situation.
> >>>>> Consider two single-term queries, s and t.  If s matches
> >>>>> more
> >>>>>
> >>>> strongly
> >>>>> than t in its top hit (e.g., a higher tf in a shorter
> >>>>> field), it
> >>>>>
> >>>> would
> >>>>> be best if the top score of s was greater score than top
> >>>>> score of
> >>>>>
> >>>> t.
> >>>>> But this kind of normalization would require keeping
> >>>>> additional statistics that so far as I know are not
> >>>>> currently in the index,
> >>>>>
> >>>> like
> >>>>> the maximum tf for terms and the minimum length for fields.
> >>>>>
> >>>> These
> >>>> could
> >>>>> be expensive to update on deletes.  Also, normalizing by
> >>>>> such
> >>>>>
> >>>> factors
> >>>>> could yield lower than subjectively reasonable scores in
> >>>>> most
> >>>>>
> >>>> cases,
> >>>> so
> >>>>> it's not clear it would be better.
> >>>>>
> >>>>> The semantics above are at least clean, easy to understand,
> >>>>> and
> >>>>>
> >>>> support
> >>>>> what seems to me is the most important motivation to do
> >>>>> this:
> >>>>>
> >>>> allowing
> >>>>> an application to use simple thresholding to segregate
> >>>>>
> >>>> likely-to-be-
> >>>>> relevant hits from likely-to-be-irrelevant hits.
> >>>>>
> >>>>> More specifically, for a BooleanQuery of TermQuery's the
> >>>>>
> >>>> resulting
> >>>> score
> >>>>> functions would be:
> >>>>>
> >>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
> >>>>>
> >>>>> baseScore(sbq, doc) =
> >>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
> >>>>> lengthNorm(tqi.term.field, doc)
> >>>>>
> >>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore
> >>>>>
> >>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
> >>>>>
> >>>>> score(sbq, doc) = rawScore * norm
> >>>>>
> >>>>> rawScore's can be computed in the Scorer.score() methods and
> >>>>>
> >>>> therefore
> >>>>> used to sort the hits (e.g., in the instance method for
> >>>>> collect()
> >>>>>
> >>>> in
> >>>> the
> >>>>> HitCollector in IndexSearcher.search()).  If the top
> >>>>> scoring hit
> >>>>>
> >>>> does
> >>>>> not have the highest baseScore, then its score could be
> >>>>> less that
> >>>>>
> >>>> its
> >>>>> coord; this seems desirable.  These formulas imply that no
> >>>>> result
> >>>>>
> >>>> score
> >>>>> can be larger than its coord, so if coord is well-defined
> >>>>> (always between 0 and 1) then all results will be
> >>>>> normalized between 0
> >>>>>
> >>>> and
> >>>> 1.
> >>>>
> >>>>> In general, the netCoord, which takes the place of coord in
> >>>>> the
> >>>>>
> >>>> simple
> >>>>> case above, needs to be defined for any query.
> >>>>> Conceptually,
> >>>>>
> >>>> this
> >>>>> should be the total percentage of query terms matched by the
> >>>>>
> >>>> document.
> >>>>> It must be recursively computable from the subquery
> >>>>> structure and overridable for specific Query types (e.g.,
> >>>>> to support
> >>>>>
> >>>> specialized
> >>>>> coords, like one that is always 1.0 as is useful in multi-
> >>>>> field searching).  Suitable default definitions for
> >>>>> TermQuery and
> >>>>>
> >>>> BooleanQuery
> >>>>> are:
> >>>>>
> >>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
> >>>>>
> >>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
> >>>>>
> >>>> ci.netCoord
> >>>>
> >>>>> This is not quite percentage of terms matched; e.g.,
> >>>>> consider a BooleanQuery with two clauses, one of which is a
> >>>>> BooleanQuery of
> >>>>>
> >>>> 3
> >>>> terms
> >>>>> and the other which is just a term.  However, it doesn't
> >>>>> seem to
> >>>>>
> >>>> be
> >>>>> unreasonable for a BooleanQuery to state that its clauses
> >>>>> are
> >>>>>
> >>>> equally
> >>>>> important, and this is consistent with the current coord
> >>>>>
> >>>> behavior.
> >>>>> BooleanQuery.netCoord could be overridden for special
> >>>>> cases, like
> >>>>>
> >>>> the
> >>>>> pure disjunction I use in my app for field expansions:
> >>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
> >>>>>
> >>>>> Implementing this would proceed along these lines: 1.  For
> >>>>> backwards compatibility, add some kind of newScoring
> >>>>>
> >>>> boolean
> >>>>> setting.
> >>>>> 2.  Update all of these places to behave as indicated if
> >>>>>
> >>>> newScoring:
> >>>>> a.  Change Query.weight() to not do any normalization (no
> >>>>>
> >>>> call
> >>>> to
> >>>>> sumOfSquaredWeights(), Similarity.queryNorm() or
> >>>>> normalize()). b.  Update all Query.weight classes to set
> >>>>> their value
> >>>>>
> >>>> according
> >>>> to
> >>>>> the terms in the score formula above that don't involve the
> >>>>>
> >>>> document
> >>>>> (e.g., boost*idf for TermQuery).
> >>>>> c.  Add suitable netCoord definitions to all Scorer
> >>>>> classes. d. Update all Scorer.score() methods to compute
> >>>>> the rawScore
> >>>>>
> >>>> as
> >>>>> above.
> >>>>> e.  Add the maximum baseScore as a field kept on TopDocs,
> >>>>>
> >>>> computed
> >>>>> in the HitCollector's.
> >>>>> f.  Change the normalization in Hits to always divide every
> >>>>>
> >>>> raw
> >>>>> score by the maximum baseScore.
> >>>>> g.  Update all of the current explain() methods to be
> >>>>>
> >>>> consistent
> >>>>> with this scoring, and to either report the rawScore, or to
> >>>>>
> >>>> report
> >>>> the
> >>>>> final score if the normalization factor is provided. h.
> >>>>> Add Hits.explain() (or better extend Searcher so that it
> >>>>>
> >>>> keeps
> >>>>> the Hits for use in Searcher.explain()) to call the new
> >>>>> explain variation with the normalization factor so that
> >>>>> final scores are
> >>>>>
> >>>> fully
> >>>>> explained.
> >>>>>
> >>>>> If this seems like a good idea, please let me know.  I'm
> >>>>> sure
> >>>>>
> >>>> there
> >>>> are
> >>>>> details I've missed that would come out during coding and
> >>>>>
> >>>> testing.
> >>>> Also,
> >>>>> the value of this is dependent on how reasonable the final
> >>>>> scores
> >>>>>
> >>>> look,
> >>>>> which is hard to tell for sure until it is working.
> >>>>>
> >>>>> The coding standards for Lucene seem reasonably clear from
> >>>>> the
> >>>>>
> >>>> source
> >>>>> code I've read.  I could use just simple Java so there
> >>>>> shouldn't
> >>>>>
> >>>> be
> >>>> any
> >>>>> significant JVM dependencies.  The above should be fully
> >>>>> backward compatible due to the newScoring flag.
> >>>>>
> >>>>> On another note, I had to remove the German analyzer in my
> >>>>>
> >>>> current
> >>>> 1.4.2
> >>>>> source configuration because GermanStemmer failed to
> >>>>> compile due
> >>>>>
> >>>> to
> >>>> what
> >>>>> are apparently Unicode character constants that I've now
> >>>>> got as
> >>>>>
> >>>> illegal
> >>>>> two-character character constants.  This is presumably an
> >>>>>
> >>>> encoding
> >>>>> problem somewhere that I could track down.  It's not
> >>>>> important,
> >>>>>
> >>>> but
> >>>> if
> >>>>> the answer is obvious to any of you, I'd appreciate the
> >>>>> quick
> >>>>>
> >>>> tip.
> >>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Chuck
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Doug Cutting [mailto:cutting@apache.org] Sent:
> >>>>>> Monday, October 18, 2004 9:44 AM To: Lucene Developers
> >>>>>> List Subject: Re: idf and explain(), was Re: Search and
> >>>>>> Scoring
> >>>>>>
> >>>>>> Chuck Williams wrote:
> >>>>>>> That's a good point on how the standard vector space
> >>>>>>> inner
> >>>>>>>
> >>>> product
> >>>>>>> similarity measure does imply that the idf is squared
> >>>>>>>
> >>>> relative
> >>>> to
> >>>>> the
> >>>>>>> document tf.  Even having been aware of this formula
> >>>>>>> for a
> >>>>>>>
> >>>> long
> >>>>> time,
> >>>>>>> this particular implication never occurred to me.  Do
> >>>>>>> you
> >>>>>>>
> >>>> know
> >>>> if
> >>>>>>> anybody has done precision/recall or other relevancy
> >>>>>>>
> >>>> empirical
> >>>>>>> measurements comparing this vs. a model that does not
> >>>>>>>
> >>>> square
> >>>> idf?
> >>>>
> >>>>>> No, not that I know of.
> >>>>>>
> >>>>>>> Regarding normalization, the normalization in Hits does
> >>>>>>> not
> >>>>>>>
> >>>> have
> >>>>> very
> >>>>>>> nice properties.  Due to the > 1.0 threshold check, it
> >>>>>>>
> >>>> loses
> >>>>>>> information, and it arbitrarily defines the highest
> >>>>>>> scoring
> >>>>>>>
> >>>> result
> >>>>> in
> >>>>>>> any list that generates scores above 1.0 as a perfect
> >>>>>>>
> >>>> match.
> >>>> It
> >>>>> would
> >>>>>>> be nice if score values were meaningful independent of
> >>>>>>>
> >>>> searches,
> >>>>> e.g.,
> >>>>>>> if "0.6" meant the same quality of retrieval
> >>>>>>> independent of
> >>>>>>>
> >>>> what
> >>>>>> search
> >>>>>>> was done.  This would allow, for example, sites to use
> >>>>>>> a a
> >>>>>>>
> >>>> simple
> >>>>>>> quality threshold to only show results that were "good
> >>>>>>>
> >>>> enough".
> >>>>> At my
> >>>>>>> last company (I was President and head of engineering
> >>>>>>> for
> >>>>>>>
> >>>> InQuira),
> >>>>> we
> >>>>>>> found this to be important to many customers.
> >>>>>>>
> >>>>>> If this is a big issue for you, as it seems it is, please
> >>>>>>
> >>>> submit
> >>>> a
> >>>>> patch
> >>>>>> to optionally disable score normalization in Hits.java.
> >>>>>>
> >>>>>>> The standard vector space similarity measure includes
> >>>>>>>
> >>>>> normalization by
> >>>>>>> the product of the norms of the vectors, i.e.:
> >>>>>>>
> >>>>>>> score(d,q) =  sum over t of ( weight(t,q) * weight(t,d)
> >>>>>>> )
> >>>>>>>
> >>>> /
> >>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
> >>>>>>>
> >>>>>> weight(t,d)^2) ]
> >>>>>>
> >>>>>>> This makes the score a cosine, which since the values
> >>>>>>> are
> >>>>>>>
> >>>> all
> >>>>> positive,
> >>>>>>> forces it to be in [0, 1].  The sumOfSquares()
> >>>>>>>
> >>>> normalization
> >>>> in
> >>>>> Lucene
> >>>>>>> does not fully implement this.  Is there a specific
> >>>>>>> reason
> >>>>>>>
> >>>> for
> >>>>> that?
> >>>>>
> >>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
> >>>>>>
> >>>> each
> >>>>> document
> >>>>>> each time a document is added to the collection, since
> >>>>>>
> >>>> 'weight(t,d)'
> >>>>> is
> >>>>>> dependent on global term statistics.  This is prohibitivly
> >>>>>>
> >>>> expensive.
> >>>>>> Research has also demonstrated that such cosine
> >>>>>> normalization
> >>>>>>
> >>>> gives
> >>>>>> somewhat inferior results (e.g., Singhal's pivoted length
> >>>>>>
> >>>>> normalization).
> >>>>>
> >>>>>>> Re. explain(), I don't see a downside to extending it
> >>>>>>> show
> >>>>>>>
> >>>> the
> >>>>> final
> >>>>>>> normalization in Hits.  It could still show the raw
> >>>>>>> score
> >>>>>>>
> >>>> just
> >>>>> prior
> >>>>>> to
> >>>>>>> that normalization.
> >>>>>>>
> >>>>>> In order to normalize scores to 1.0 one must know the
> >>>>>> maximum
> >>>>>>
> >>>> score.
> >>>>>> Explain only computes the score for a single document, and
> >>>>>>
> >>>> the
> >>>>> maximum
> >>>>>> score is not known.
> >>>>>>
> >>>>>>> Although I think it would be best to have a
> >>>>>>> normalization that would render scores comparable across
> >>>>>>>
> >>>> searches.
> >>>>
> >>>>>> Then please submit a patch.  Lucene doesn't change on its
> >>>>>>
> >>>> own.
> >>>>
> >>>>>> Doug
> >>>>>>
> >>>>>>
> >>>> --------------------------------------------------------------
> >>>> ---- --
> >>>>
> >>>>> -
> >>>>>> To unsubscribe, e-mail:
> >>>> lucene-dev-unsubscribe@jakarta.apache.org
> >>>>>> For additional commands, e-mail:
> >>>> lucene-dev-help@jakarta.apache.org
> >>>>
> >>>>
> >>>> --------------------------------------------------------------
> >>>> ---- --- To unsubscribe, e-mail: lucene-dev-
> >>>> unsubscribe@jakarta.apache.org For additional commands, e-
> >>>> mail: lucene-dev-help@jakarta.apache.org
> >>>
> >>>
> >
--------------------------------------------------------------------
> >
> >>> - To unsubscribe, e-mail: lucene-dev-
> >>> unsubscribe@jakarta.apache.org For additional commands, e-mail:
> >>> lucene-dev-help@jakarta.apache.org
> >>
> >>
> >> ------------------------------------------------------------------
> >> --- To unsubscribe, e-mail: lucene-dev-
> >> unsubscribe@jakarta.apache.org For additional commands, e-mail:
> >> lucene-dev-help@jakarta.apache.org
> >
> >
> >
--------------------------------------------------------------------
> > - To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message