Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Received-SPF: neutral (hermes.apache.org: local policy)
From: Kelvin Tan <kelvin-lists@relevanz.com>
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Date: Mon, 7 Feb 2005 19:16:19 +0100
Message-ID: <200527191619.132398@Kelvin>
In-Reply-To: 
 <F27A90676497A94FBC23C49D8D8EA98C23B0D7@atlantis.thop-ny.triplehop.com>
Subject: RE: Study Group (WAS Re: Normalized Scoring)
Mime-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

People!!! thanks for sharing, but this is what the wiki is for!! Will=
 everyone kindly add their posts to=
 http://wiki.apache.org/jakarta-lucene/InformationRetrieval for posterity?

k

On Mon, 7 Feb 2005 12:48:02 -0500, Joaquin Delgado wrote:
>=A0A very solid (and free) online course on "intelligent information
>=A0retrieval" with focus on practical issues can be found on Prof.
>=A0Mooney (Univ. of Texas): http://www.cs.utexas.edu/users/mooney/ir-
>=A0course/
>
>
>=A0I've also copied two interesting papers (from my own private
>=A0library -- if you are interested I've got much more to offer) for
>=A0your reading:
>
>
>=A0For those looking for an answer to the unsolved mystery in IR:
>=A0http://www.triplehop.com/pdf/How_Many_Relevances_in_IR.pdf
>
>=A0For those interesting in extending IR systems with Statistical NLP
>=A0and Machine Learning
>=A0http://www.triplehop.com/pdf/Text_Representation_and_ML_General_Concep
>=A0ts .pdf
>
>
>=A0BTW, if someone has had a look at http://www.find.com (meta-search,
>=A0indexing and concept clustering system) I'd be interested in their
>=A0opinion.
>
>
>=A0-- Joaquin
>
>
>=A0-----Original Message-----
>=A0From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] Sent:
>=A0Monday, February 07, 2005 12:15 PM To: Lucene Developers List
>=A0Subject: Re: Study Group (WAS Re: Normalized Scoring)
>
>=A0I think I see what you are after. =A0I'm after the same knowledge. :)
>
>=A0The only things that I can recommend are books:
>=A0Modern Information Retrieval
>=A0Managing Gigabytes
>
>=A0And online resources like:
>=A0http://finance.groups.yahoo.com/group/mg/ (note the weird host
>=A0name) http://www.sims.berkeley.edu/~hearst/irbook/
>
>=A0There is a pile of stuff in Citeseer, but those papers never really
>=A0dig into the details and always require solid previous knowledge of
>=A0the field. =A0They are no replacement for a class or a textbook.
>
>=A0If you find a good forum for IR, please share.
>
>=A0Otis
>
>
>=A0--- Kelvin Tan <kelvin-lists@relevanz.com>=A0wrote:
>
>>=A0Wouldn't it be great if we can form a study-group of Lucene folks
>>=A0who want to take the "next step"? I feel uneasy posting non-
>>=A0Lucene specific questions to dev or user even if its related to
>>=A0IR.
>>
>>=A0Feels to me like there could be a couple like us, who didn't do a
>>=A0dissertation in IR, but would like a more indepth knowledge for
>>=A0practical purposes. Basically, the end result is that we are able
>>=A0to tune or extend lucene by using the Expert api (classes marked
>>=A0as Expert). Perhaps a possible outcome is a tuning tutorial for
>>=A0advanced users who already know how to use Lucene.
>>
>>=A0What do you think?
>>
>>=A0k
>>
>>=A0On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
>>
>>>=A0Exactly. =A0Luckily, since then I've learned a bit from lucene-
>>>=A0dev discussions and side IR readings, so some of the topics are
>>>=A0making more sense now.
>>>
>>>=A0Otis
>>>
>>>=A0--- Kelvin Tan <kelvin-lists@relevanz.com>=A0wrote:
>>>
>>>>=A0Hi Otis, I was re-reading this whole theoretical thread about
>>>>=A0idf, scoring, normalization, etc from last Oct and couldn't
>>>>=A0help laughing out loud when I read your post, coz it summed
>>>>=A0up what I was thinking the whole time. I think its really
>>>>=A0great to have people like Chuck and Paul (Eshlot) around. I'm
>>>>=A0learning so much.
>>>>
>>>>=A0k
>>>>
>>>>=A0On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic
>>>>=A0wrote:
>>>>
>>>>>=A0Hi Chuck,
>>>>>
>>>>>=A0The relative lack of responses is not because there is no
>>>>>=A0interest, but probably because there are only a few people
>>>>>=A0on lucene-dev who can follow/understand every detail of
>>>>>=A0your proposal. =A0I understand and hear you, but I have a
>>>>>=A0hard time 'visualizing' some of the formulas in your
>>>>>=A0proposal. =A0What you are saying sounds right to me, but I
>>>>>=A0don't have enough theoretical knowledge to go one way or
>>>>>=A0the other.
>>>>>
>>>>>=A0Otis
>>>>>
>>>>>
>>>>>=A0--- Chuck Williams <chuck@manawiz.com>=A0wrote:
>>>>>
>>>>>>=A0Hi everybody,
>>>>>>
>>>>>>=A0Although there doesn't seem to be much interest in this I
>>>>>>=A0have one significant improvement to the below and a
>>>>>>=A0couple observations that clarify the situation.
>>>>>>
>>>>>>=A0To illustrate the problem better normalization is
>>>>>>=A0intended to address,
>>>>>>=A0in my current application for BooleanQuery's of two
>>>>>>=A0terms, I frequently
>>>>>>=A0get a top score of 1.0 when only one of the terms is
>>>>>>=A0matched. 1.0 should indicate a "perfect match". =A0I'd like
>>>>>>=A0set my UI up to present the
>>>>>>=A0results differently depending on how good the different
>>>>>>=A0results are (e.g., showing a visual indication of result
>>>>>>=A0quality, dropping the really bad results entirely, and
>>>>>>=A0segregating the good results from other
>>>>>>=A0only vaguely relevant results). =A0To build this kind of
>>>>>>=A0"intelligence" into the UI, I certainly need to know
>>>>>>=A0whether my top result matched all
>>>>>>=A0query terms, or only half of them. =A0I'd like to have the
>>>>>>=A0score tell me
>>>>>>=A0directly how good the matches are. =A0The current
>>>>>>=A0normalization does not achieve this.
>>>>>>
>>>>>>=A0The intent of a new normalization scheme is to preserve
>>>>>>=A0the current scoring in the following sense: =A0the ratio of
>>>>>>=A0the nth result's score to
>>>>>>=A0the best result's score remains the same. =A0Therefore, the
>>>>>>=A0only question
>>>>>>=A0is what normalization factor to apply to all scores.
>>>>>>=A0This reduces to a
>>>>>>=A0very specific question that determines the entire
>>>>>>=A0normalization -- what should the score of the best
>>>>>>=A0matching result be?
>>>>>>
>>>>>>=A0The mechanism below has this property, i.e. it keeps the
>>>>>>=A0current score
>>>>>>=A0ratios, except that I removed one idf term for reasons
>>>>>>=A0covered earlier
>>>>>>=A0(this better reflects the current empirically best
>>>>>>=A0scoring algorithms).
>>>>>>=A0However, removing an idf is a completely separate issue.
>>>>>>=A0The improved
>>>>>>=A0normalization is independent of whether or not that
>>>>>>=A0change is made.
>>>>>>
>>>>>>=A0For the central question of what the top score should be,
>>>>>>=A0upon reflection, I don't like the definition below. =A0It
>>>>>>=A0defined the top score
>>>>>>=A0as (approximately) the percentage of query terms matched
>>>>>>=A0by the top scoring result. =A0A better conceptual
>>>>>>=A0definition is to use a weighted average based on the
>>>>>>=A0boosts. =A0I.e., downward propagate all boosts to the
>>>>>>=A0underlying terms (or phrases). =A0Secifically, the "net
>>>>>>=A0boost" of a term
>>>>>>=A0is the product of the direct boost of the term and all
>>>>>>=A0boosts applied to
>>>>>>=A0encompassing clauses. =A0Then the score of the top result
>>>>>>=A0becomes the sum
>>>>>>=A0of the net boosts of its matching terms divided by the
>>>>>>=A0sum of the net boosts of all query terms.
>>>>>>
>>>>>>=A0This definition is a refinement of the original proposal
>>>>>>=A0below, and it
>>>>>>=A0could probably be further refined if some impact of the
>>>>>>=A0tf, idf and/or
>>>>>>=A0lengthNorm was desired in determining the top score.
>>>>>>=A0These other factors seems to be harder to normalize for,
>>>>>>=A0although I've thought of some simple approaches; e.g.,
>>>>>>=A0assume the unmatched terms in the top result have values
>>>>>>=A0for these three factors that are the average of the
>>>>>>=A0values of the matched terms, then apply exactly the same
>>>>>>=A0concept of actual score divided by theorectical maximum
>>>>>>=A0score. =A0That would eliminate any need to maintain maximum
>>>>>>=A0value statistics in the index.
>>>>>>
>>>>>>=A0As an example of the simple boost-based normalization,
>>>>>>=A0for the query ((a^2 b)^3 (c d^2)) the net boosts are: a --
>>>>>>=A0>=A06 b --
>>>>>>
>>>>>>>=A03 c --
>>>>>>>
>>>>>>>=A01 d -->=A02
>>>>>>>
>>>>>>=A0So if a and b matched, but not c and d, in the top
>>>>>>=A0scoring result, its
>>>>>>=A0score would be 0.75. =A0The normalizer would be
>>>>>>=A00.75/(current score except
>>>>>>=A0for the current normalization). =A0This normalizer would be
>>>>>>=A0applied to all
>>>>>>=A0current scores (minus normalization) to create the
>>>>>>=A0normalized scores.
>>>>>>
>>>>>>=A0For simple query (a b), if only one of the terms matched
>>>>>>=A0in the top result, then its score would be 0.5, vs. 1.0
>>>>>>=A0or many other possible scores today.
>>>>>>
>>>>>>=A0In addition to enabling more "intelligent" UI's that
>>>>>>=A0communicate the quality of results to end-users, the
>>>>>>=A0proposal below also extends the explain() mechanism to
>>>>>>=A0fully explain the final normalized score. However, that
>>>>>>=A0change is also independent -- it could be done with the
>>>>>>=A0current scoring.
>>>>>>
>>>>>>=A0Am I the only one who would like to see better
>>>>>>=A0normalization in Lucene? Does anybody have a better
>>>>>>=A0approach?
>>>>>>
>>>>>>=A0If you've read this far, thanks for indulging me on this.
>>>>>>=A0 I would love
>>>>>>=A0to see this or something with similar properties in
>>>>>>=A0Lucene. I really just want to build my app, but as stated
>>>>>>=A0below would write and contribute this if there is
>>>>>>=A0interest in putting it in, and nobody else
>>>>>>=A0wants to write it. =A0Please let me know what you think one
>>>>>>=A0way or the other.
>>>>>>
>>>>>>=A0Thanks,
>>>>>>
>>>>>>=A0Chuck
>>>>>>
>>>>>>
>>>>>>>=A0-----Original Message-----
>>>>>>>=A0From: Chuck Williams
>>>>>>>=A0Sent: Monday, October 18, 2004 7:04 PM
>>>>>>>=A0To: 'Lucene Developers List'
>>>>>>>=A0Subject: RE: idf and explain(), was Re: Search and
>>>>>>>=A0Scoring
>>>>>>>
>>>>>>>=A0Doug Cutting wrote:
>>>>>>>>=A0If this is a big issue for you, as it seems it is,
>>>>>>>>=A0please
>>>>>>>>
>>>>>>=A0submit
>>>>>>=A0a
>>>>>>>=A0patch
>>>>>>>>=A0to optionally disable score normalization in
>>>>>>>>=A0Hits.java.
>>>>>>>>
>>>>>>>=A0and:
>>>>>>>>=A0The quantity 'sum(t) weight(t,d)^2' must be
>>>>>>>>=A0recomputed for
>>>>>>>>
>>>>>>=A0each
>>>>>>>=A0document
>>>>>>>>=A0each time a document is added to the collection, since
>>>>>>>>
>>>>>>=A0'weight(t,d)'
>>>>>>>=A0is
>>>>>>>>=A0dependent on global term statistics. =A0This is
>>>>>>>>=A0prohibitivly
>>>>>>>>
>>>>>>=A0expensive.
>>>>>>>>=A0Research has also demonstrated that such cosine
>>>>>>>>=A0normalization
>>>>>>>>
>>>>>>=A0gives
>>>>>>>>=A0somewhat inferior results (e.g., Singhal's pivoted
>>>>>>>>=A0length
>>>>>>>>
>>>>>>>=A0normalization).
>>>>>>>
>>>>>>>=A0I'm willing to write, test and contribute code to
>>>>>>>=A0address the normalization issue, i.e. to yield scores
>>>>>>>=A0in [0, 1] that are
>>>>>>>
>>>>>>=A0meaningful
>>>>>>>=A0across searches. =A0Unfortunately, this is considerably
>>>>>>>=A0more
>>>>>>>
>>>>>>=A0involved
>>>>>>=A0that
>>>>>>>=A0just optionally eliminating the current normalization
>>>>>>>=A0in Hits.
>>>>>>>
>>>>>>=A0Before
>>>>>>>=A0undertaking this, I'd like to see if there is agreement
>>>>>>>=A0in
>>>>>>>
>>>>>>=A0principle
>>>>>>>=A0that this is a good idea, and that my specific proposal
>>>>>>>=A0below is
>>>>>>>
>>>>>>=A0the
>>>>>>>=A0right way to go. =A0Also, I'd like to make sure I've
>>>>>>>=A0correctly
>>>>>>>
>>>>>>=A0inferred
>>>>>>>=A0the constraints on writing code to be incorporated into
>>>>>>>=A0Lucene.
>>>>>>>
>>>>>>>=A0After looking at this in more detail I agree that the
>>>>>>>=A0cosine normalization is not the way to go, because of
>>>>>>>=A0both efficiency
>>>>>>>
>>>>>>=A0and
>>>>>>>=A0effectiveness considerations. =A0A conceptual approach
>>>>>>>=A0that would
>>>>>>>
>>>>>>=A0be
>>>>>>>=A0efficient, relatively easy to implement, and seems to
>>>>>>>=A0have at
>>>>>>>
>>>>>>=A0least
>>>>>>>=A0reasonable behavior would be to define the top scoring
>>>>>>>=A0match to
>>>>>>>
>>>>>>=A0have
>>>>>>=A0a
>>>>>>>=A0score roughly equal to the percentage of query terms it
>>>>>>>=A0matches
>>>>>>>
>>>>>>=A0(its
>>>>>>>=A0"netCoord"). =A0Scores below the top hit would be reduced
>>>>>>>=A0based on
>>>>>>>
>>>>>>=A0the
>>>>>>>=A0ratio of their raw scores to the raw score of the top
>>>>>>>=A0hit
>>>>>>>
>>>>>>=A0(considering
>>>>>>>=A0all of the current score factors, except that I'd like
>>>>>>>=A0to remove
>>>>>>>
>>>>>>=A0one
>>>>>>=A0of
>>>>>>>=A0the idf factors, as discussed earlier).
>>>>>>>
>>>>>>>=A0For a couple simple cases:
>>>>>>>=A0a) the top match for a single term query would always
>>>>>>>=A0have a
>>>>>>>
>>>>>>=A0score
>>>>>>=A0of
>>>>>>>=A01.0,
>>>>>>>=A0b) the top scoring match for a BooleanQuery using
>>>>>>>
>>>>>>=A0DefaultSimilarity
>>>>>>>=A0with all non-prohibited TermQuery clauses would have a
>>>>>>>=A0score of
>>>>>>>
>>>>>>=A0m/n,
>>>>>>>=A0where the hit matches m of the n terms.
>>>>>>>
>>>>>>>=A0This isn't optimal, but seems much better than the
>>>>>>>=A0current
>>>>>>>
>>>>>>=A0situation.
>>>>>>>=A0Consider two single-term queries, s and t. =A0If s
>>>>>>>=A0matches more
>>>>>>>
>>>>>>=A0strongly
>>>>>>>=A0than t in its top hit (e.g., a higher tf in a shorter
>>>>>>>=A0field), it
>>>>>>>
>>>>>>=A0would
>>>>>>>=A0be best if the top score of s was greater score than
>>>>>>>=A0top score of
>>>>>>>
>>>>>>=A0t.
>>>>>>>=A0But this kind of normalization would require keeping
>>>>>>>=A0additional statistics that so far as I know are not
>>>>>>>=A0currently in the index,
>>>>>>>
>>>>>>=A0like
>>>>>>>=A0the maximum tf for terms and the minimum length for
>>>>>>>=A0fields.
>>>>>>>
>>>>>>=A0These
>>>>>>=A0could
>>>>>>>=A0be expensive to update on deletes. =A0Also, normalizing
>>>>>>>=A0by such
>>>>>>>
>>>>>>=A0factors
>>>>>>>=A0could yield lower than subjectively reasonable scores
>>>>>>>=A0in most
>>>>>>>
>>>>>>=A0cases,
>>>>>>=A0so
>>>>>>>=A0it's not clear it would be better.
>>>>>>>
>>>>>>>=A0The semantics above are at least clean, easy to
>>>>>>>=A0understand, and
>>>>>>>
>>>>>>=A0support
>>>>>>>=A0what seems to me is the most important motivation to do
>>>>>>>=A0this:
>>>>>>>
>>>>>>=A0allowing
>>>>>>>=A0an application to use simple thresholding to segregate
>>>>>>>
>>>>>>=A0likely-to-be-
>>>>>>>=A0relevant hits from likely-to-be-irrelevant hits.
>>>>>>>
>>>>>>>=A0More specifically, for a BooleanQuery of TermQuery's the
>>>>>>>
>>>>>>=A0resulting
>>>>>>=A0score
>>>>>>>=A0functions would be:
>>>>>>>
>>>>>>>=A0BooleanQuery of TermQuery's sbq =3D (tq1 ... tqn)
>>>>>>>
>>>>>>>=A0baseScore(sbq, doc) =3D
>>>>>>>=A0sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>>>>>>>=A0lengthNorm(tqi.term.field, doc)
>>>>>>>
>>>>>>>=A0rawScore(sbq, doc) =3D coord(sbq, doc) * baseScore
>>>>>>>
>>>>>>>=A0norm(sbq, hits) =3D 1 / max(hit in hits) baseScore(sbq,
>>>>>>>=A0hit)
>>>>>>>
>>>>>>>=A0score(sbq, doc) =3D rawScore * norm
>>>>>>>
>>>>>>>=A0rawScore's can be computed in the Scorer.score()
>>>>>>>=A0methods and
>>>>>>>
>>>>>>=A0therefore
>>>>>>>=A0used to sort the hits (e.g., in the instance method for
>>>>>>>=A0collect()
>>>>>>>
>>>>>>=A0in
>>>>>>=A0the
>>>>>>>=A0HitCollector in IndexSearcher.search()). =A0If the top
>>>>>>>=A0scoring hit
>>>>>>>
>>>>>>=A0does
>>>>>>>=A0not have the highest baseScore, then its score could be
>>>>>>>=A0less that
>>>>>>>
>>>>>>=A0its
>>>>>>>=A0coord; this seems desirable. =A0These formulas imply that
>>>>>>>=A0no result
>>>>>>>
>>>>>>=A0score
>>>>>>>=A0can be larger than its coord, so if coord is well-
>>>>>>>=A0defined (always between 0 and 1) then all results will
>>>>>>>=A0be normalized between 0
>>>>>>>
>>>>>>=A0and
>>>>>>=A01.
>>>>>>
>>>>>>>=A0In general, the netCoord, which takes the place of
>>>>>>>=A0coord in the
>>>>>>>
>>>>>>=A0simple
>>>>>>>=A0case above, needs to be defined for any query.
>>>>>>>=A0Conceptually,
>>>>>>>
>>>>>>=A0this
>>>>>>>=A0should be the total percentage of query terms matched
>>>>>>>=A0by the
>>>>>>>
>>>>>>=A0document.
>>>>>>>=A0It must be recursively computable from the subquery
>>>>>>>=A0structure and overridable for specific Query types
>>>>>>>=A0(e.g., to support
>>>>>>>
>>>>>>=A0specialized
>>>>>>>=A0coords, like one that is always 1.0 as is useful in
>>>>>>>=A0multi- field searching). =A0Suitable default definitions
>>>>>>>=A0for TermQuery and
>>>>>>>
>>>>>>=A0BooleanQuery
>>>>>>>=A0are:
>>>>>>>
>>>>>>>=A0TermQuery.netCoord =3D 1.0 if term matches, 0.0 otherwise
>>>>>>>
>>>>>>>=A0BooleanQuery(c1 ... cn).netCoord =3D sum(ci) coord(1, n) *
>>>>>>>
>>>>>>=A0ci.netCoord
>>>>>>
>>>>>>>=A0This is not quite percentage of terms matched; e.g.,
>>>>>>>=A0consider a BooleanQuery with two clauses, one of which
>>>>>>>=A0is a BooleanQuery of
>>>>>>>
>>>>>>=A03
>>>>>>=A0terms
>>>>>>>=A0and the other which is just a term. =A0However, it
>>>>>>>=A0doesn't seem to
>>>>>>>
>>>>>>=A0be
>>>>>>>=A0unreasonable for a BooleanQuery to state that its
>>>>>>>=A0clauses are
>>>>>>>
>>>>>>=A0equally
>>>>>>>=A0important, and this is consistent with the current coord
>>>>>>>
>>>>>>=A0behavior.
>>>>>>>=A0BooleanQuery.netCoord could be overridden for special
>>>>>>>=A0cases, like
>>>>>>>
>>>>>>=A0the
>>>>>>>=A0pure disjunction I use in my app for field expansions:
>>>>>>>=A0MaxDisjunctionQuery(c1 .. cn).netCoord =3D max(ci)
>>>>>>>=A0ci.netCoord
>>>>>>>
>>>>>>>=A0Implementing this would proceed along these lines: 1.
>>>>>>>=A0For backwards compatibility, add some kind of newScoring
>>>>>>>
>>>>>>=A0boolean
>>>>>>>=A0setting.
>>>>>>>=A02. =A0Update all of these places to behave as indicated if
>>>>>>>
>>>>>>=A0newScoring:
>>>>>>>=A0a. =A0Change Query.weight() to not do any normalization
>>>>>>>=A0(no
>>>>>>>
>>>>>>=A0call
>>>>>>=A0to
>>>>>>>=A0sumOfSquaredWeights(), Similarity.queryNorm() or
>>>>>>>=A0normalize()). b. =A0Update all Query.weight classes to
>>>>>>>=A0set their value
>>>>>>>
>>>>>>=A0according
>>>>>>=A0to
>>>>>>>=A0the terms in the score formula above that don't involve
>>>>>>>=A0the
>>>>>>>
>>>>>>=A0document
>>>>>>>=A0(e.g., boost*idf for TermQuery).
>>>>>>>=A0c. =A0Add suitable netCoord definitions to all Scorer
>>>>>>>=A0classes. d. Update all Scorer.score() methods to
>>>>>>>=A0compute the rawScore
>>>>>>>
>>>>>>=A0as
>>>>>>>=A0above.
>>>>>>>=A0e. =A0Add the maximum baseScore as a field kept on
>>>>>>>=A0TopDocs,
>>>>>>>
>>>>>>=A0computed
>>>>>>>=A0in the HitCollector's.
>>>>>>>=A0f. =A0Change the normalization in Hits to always divide
>>>>>>>=A0every
>>>>>>>
>>>>>>=A0raw
>>>>>>>=A0score by the maximum baseScore.
>>>>>>>=A0g. =A0Update all of the current explain() methods to be
>>>>>>>
>>>>>>=A0consistent
>>>>>>>=A0with this scoring, and to either report the rawScore,
>>>>>>>=A0or to
>>>>>>>
>>>>>>=A0report
>>>>>>=A0the
>>>>>>>=A0final score if the normalization factor is provided. h.
>>>>>>>=A0Add Hits.explain() (or better extend Searcher so that it
>>>>>>>
>>>>>>=A0keeps
>>>>>>>=A0the Hits for use in Searcher.explain()) to call the new
>>>>>>>=A0explain variation with the normalization factor so that
>>>>>>>=A0final scores are
>>>>>>>
>>>>>>=A0fully
>>>>>>>=A0explained.
>>>>>>>
>>>>>>>=A0If this seems like a good idea, please let me know.
>>>>>>>=A0I'm sure
>>>>>>>
>>>>>>=A0there
>>>>>>=A0are
>>>>>>>=A0details I've missed that would come out during coding
>>>>>>>=A0and
>>>>>>>
>>>>>>=A0testing.
>>>>>>=A0Also,
>>>>>>>=A0the value of this is dependent on how reasonable the
>>>>>>>=A0final scores
>>>>>>>
>>>>>>=A0look,
>>>>>>>=A0which is hard to tell for sure until it is working.
>>>>>>>
>>>>>>>=A0The coding standards for Lucene seem reasonably clear
>>>>>>>=A0from the
>>>>>>>
>>>>>>=A0source
>>>>>>>=A0code I've read. =A0I could use just simple Java so there
>>>>>>>=A0shouldn't
>>>>>>>
>>>>>>=A0be
>>>>>>=A0any
>>>>>>>=A0significant JVM dependencies. =A0The above should be
>>>>>>>=A0fully backward compatible due to the newScoring flag.
>>>>>>>
>>>>>>>=A0On another note, I had to remove the German analyzer in
>>>>>>>=A0my
>>>>>>>
>>>>>>=A0current
>>>>>>=A01.4.2
>>>>>>>=A0source configuration because GermanStemmer failed to
>>>>>>>=A0compile due
>>>>>>>
>>>>>>=A0to
>>>>>>=A0what
>>>>>>>=A0are apparently Unicode character constants that I've
>>>>>>>=A0now got as
>>>>>>>
>>>>>>=A0illegal
>>>>>>>=A0two-character character constants. =A0This is presumably
>>>>>>>=A0an
>>>>>>>
>>>>>>=A0encoding
>>>>>>>=A0problem somewhere that I could track down. =A0It's not
>>>>>>>=A0important,
>>>>>>>
>>>>>>=A0but
>>>>>>=A0if
>>>>>>>=A0the answer is obvious to any of you, I'd appreciate the
>>>>>>>=A0quick
>>>>>>>
>>>>>>=A0tip.
>>>>>>
>>>>>>>=A0Thanks,
>>>>>>>
>>>>>>>=A0Chuck
>>>>>>>
>>>>>>>>=A0-----Original Message-----
>>>>>>>>=A0From: Doug Cutting [mailto:cutting@apache.org] Sent:
>>>>>>>>=A0Monday, October 18, 2004 9:44 AM To: Lucene
>>>>>>>>=A0Developers List Subject: Re: idf and explain(), was
>>>>>>>>=A0Re: Search and Scoring
>>>>>>>>
>>>>>>>>=A0Chuck Williams wrote:
>>>>>>>>>=A0That's a good point on how the standard vector
>>>>>>>>>=A0space inner
>>>>>>>>>
>>>>>>=A0product
>>>>>>>>>=A0similarity measure does imply that the idf is
>>>>>>>>>=A0squared
>>>>>>>>>
>>>>>>=A0relative
>>>>>>=A0to
>>>>>>>=A0the
>>>>>>>>>=A0document tf. =A0Even having been aware of this
>>>>>>>>>=A0formula for a
>>>>>>>>>
>>>>>>=A0long
>>>>>>>=A0time,
>>>>>>>>>=A0this particular implication never occurred to me.
>>>>>>>>>=A0Do you
>>>>>>>>>
>>>>>>=A0know
>>>>>>=A0if
>>>>>>>>>=A0anybody has done precision/recall or other relevancy
>>>>>>>>>
>>>>>>=A0empirical
>>>>>>>>>=A0measurements comparing this vs. a model that does
>>>>>>>>>=A0not
>>>>>>>>>
>>>>>>=A0square
>>>>>>=A0idf?
>>>>>>
>>>>>>>>=A0No, not that I know of.
>>>>>>>>
>>>>>>>>>=A0Regarding normalization, the normalization in Hits
>>>>>>>>>=A0does not
>>>>>>>>>
>>>>>>=A0have
>>>>>>>=A0very
>>>>>>>>>=A0nice properties. =A0Due to the >=A01.0 threshold check,
>>>>>>>>>=A0it
>>>>>>>>>
>>>>>>=A0loses
>>>>>>>>>=A0information, and it arbitrarily defines the highest
>>>>>>>>>=A0scoring
>>>>>>>>>
>>>>>>=A0result
>>>>>>>=A0in
>>>>>>>>>=A0any list that generates scores above 1.0 as a
>>>>>>>>>=A0perfect
>>>>>>>>>
>>>>>>=A0match.
>>>>>>=A0It
>>>>>>>=A0would
>>>>>>>>>=A0be nice if score values were meaningful independent
>>>>>>>>>=A0of
>>>>>>>>>
>>>>>>=A0searches,
>>>>>>>=A0e.g.,
>>>>>>>>>=A0if "0.6" meant the same quality of retrieval
>>>>>>>>>=A0independent of
>>>>>>>>>
>>>>>>=A0what
>>>>>>>>=A0search
>>>>>>>>>=A0was done. =A0This would allow, for example, sites to
>>>>>>>>>=A0use a a
>>>>>>>>>
>>>>>>=A0simple
>>>>>>>>>=A0quality threshold to only show results that were
>>>>>>>>>=A0"good
>>>>>>>>>
>>>>>>=A0enough".
>>>>>>>=A0At my
>>>>>>>>>=A0last company (I was President and head of
>>>>>>>>>=A0engineering for
>>>>>>>>>
>>>>>>=A0InQuira),
>>>>>>>=A0we
>>>>>>>>>=A0found this to be important to many customers.
>>>>>>>>>
>>>>>>>>=A0If this is a big issue for you, as it seems it is,
>>>>>>>>=A0please
>>>>>>>>
>>>>>>=A0submit
>>>>>>=A0a
>>>>>>>=A0patch
>>>>>>>>=A0to optionally disable score normalization in
>>>>>>>>=A0Hits.java.
>>>>>>>>
>>>>>>>>>=A0The standard vector space similarity measure
>>>>>>>>>=A0includes
>>>>>>>>>
>>>>>>>=A0normalization by
>>>>>>>>>=A0the product of the norms of the vectors, i.e.:
>>>>>>>>>
>>>>>>>>>=A0score(d,q) =3D =A0sum over t of ( weight(t,q) *
>>>>>>>>>=A0weight(t,d) )
>>>>>>>>>
>>>>>>=A0/
>>>>>>>>>=A0sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
>>>>>>>>>
>>>>>>>>=A0weight(t,d)^2) ]
>>>>>>>>
>>>>>>>>>=A0This makes the score a cosine, which since the
>>>>>>>>>=A0values are
>>>>>>>>>
>>>>>>=A0all
>>>>>>>=A0positive,
>>>>>>>>>=A0forces it to be in [0, 1]. =A0The sumOfSquares()
>>>>>>>>>
>>>>>>=A0normalization
>>>>>>=A0in
>>>>>>>=A0Lucene
>>>>>>>>>=A0does not fully implement this. =A0Is there a specific
>>>>>>>>>=A0reason
>>>>>>>>>
>>>>>>=A0for
>>>>>>>=A0that?
>>>>>>>
>>>>>>>>=A0The quantity 'sum(t) weight(t,d)^2' must be
>>>>>>>>=A0recomputed for
>>>>>>>>
>>>>>>=A0each
>>>>>>>=A0document
>>>>>>>>=A0each time a document is added to the collection, since
>>>>>>>>
>>>>>>=A0'weight(t,d)'
>>>>>>>=A0is
>>>>>>>>=A0dependent on global term statistics. =A0This is
>>>>>>>>=A0prohibitivly
>>>>>>>>
>>>>>>=A0expensive.
>>>>>>>>=A0Research has also demonstrated that such cosine
>>>>>>>>=A0normalization
>>>>>>>>
>>>>>>=A0gives
>>>>>>>>=A0somewhat inferior results (e.g., Singhal's pivoted
>>>>>>>>=A0length
>>>>>>>>
>>>>>>>=A0normalization).
>>>>>>>
>>>>>>>>>=A0Re. explain(), I don't see a downside to extending
>>>>>>>>>=A0it show
>>>>>>>>>
>>>>>>=A0the
>>>>>>>=A0final
>>>>>>>>>=A0normalization in Hits. =A0It could still show the raw
>>>>>>>>>=A0score
>>>>>>>>>
>>>>>>=A0just
>>>>>>>=A0prior
>>>>>>>>=A0to
>>>>>>>>>=A0that normalization.
>>>>>>>>>
>>>>>>>>=A0In order to normalize scores to 1.0 one must know the
>>>>>>>>=A0maximum
>>>>>>>>
>>>>>>=A0score.
>>>>>>>>=A0Explain only computes the score for a single
>>>>>>>>=A0document, and
>>>>>>>>
>>>>>>=A0the
>>>>>>>=A0maximum
>>>>>>>>=A0score is not known.
>>>>>>>>
>>>>>>>>>=A0Although I think it would be best to have a
>>>>>>>>>=A0normalization that would render scores comparable
>>>>>>>>>=A0across
>>>>>>>>>
>>>>>>=A0searches.
>>>>>>
>>>>>>>>=A0Then please submit a patch. =A0Lucene doesn't change on
>>>>>>>>=A0its
>>>>>>>>
>>>>>>=A0own.
>>>>>>
>>>>>>>>=A0Doug
>>>>>>>>
>>>>>>>>
>>>>>>=A0----------------------------------------------------------
>>>>>>=A0---- ---- --
>>>>>>
>>>>>>>=A0-
>>>>>>>>=A0To unsubscribe, e-mail:
>>>>>>=A0lucene-dev-unsubscribe@jakarta.apache.org
>>>>>>>>=A0For additional commands, e-mail:
>>>>>>=A0lucene-dev-help@jakarta.apache.org
>>>>>>
>>>>>>
>>>>>>=A0----------------------------------------------------------
>>>>>>=A0---- ---- --- To unsubscribe, e-mail: lucene-dev-
>>>>>>=A0unsubscribe@jakarta.apache.org For additional commands, e-
>>>>>>=A0 mail: lucene-dev-help@jakarta.apache.org
>>>>>
>>>
>=A0--------------------------------------------------------------------
>
>>>>>=A0- To unsubscribe, e-mail: lucene-dev-
>>>>>=A0unsubscribe@jakarta.apache.org For additional commands, e-
>>>>>=A0mail: lucene-dev-help@jakarta.apache.org
>>>>
>>>>
>>>>=A0--------------------------------------------------------------
>>>>=A0---- --- To unsubscribe, e-mail: lucene-dev-
>>>>=A0unsubscribe@jakarta.apache.org For additional commands, e-
>>>>=A0mail: lucene-dev-help@jakarta.apache.org
>>>
>>>
>=A0--------------------------------------------------------------------
>
>>>=A0- To unsubscribe, e-mail: lucene-dev-
>>>=A0unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>>=A0lucene-dev-help@jakarta.apache.org
>>
>>
>>=A0------------------------------------------------------------------
>>=A0--- To unsubscribe, e-mail: lucene-dev-
>>=A0unsubscribe@jakarta.apache.org For additional commands, e-mail:
>>=A0lucene-dev-help@jakarta.apache.org
>
>
>=A0--------------------------------------------------------------------
>=A0- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>=A0For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>=A0--------------------------------------------------------------------
>=A0- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>=A0For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org