Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 48762 invoked from network); 7 Feb 2005 18:16:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 7 Feb 2005 18:16:48 -0000 Received: (qmail 2791 invoked by uid 500); 7 Feb 2005 18:16:47 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 2516 invoked by uid 500); 7 Feb 2005 18:16:46 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 2503 invoked by uid 99); 7 Feb 2005 18:16:45 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from lyra.lunarpages.com (HELO lyra.lunarpages.com) (64.235.230.140) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 07 Feb 2005 10:16:42 -0800 Received: from mueasb-wan159.citykom.de ([195.202.39.159] helo=Kelvin) by lyra.lunarpages.com with esmtpa (Exim 4.44) id 1CyDR2-0006jO-1J for lucene-dev@jakarta.apache.org; Mon, 07 Feb 2005 10:16:37 -0800 From: Kelvin Tan To: Lucene Developers List X-Mailer: Barca 1.1 (840) - Licensed Version Date: Mon, 7 Feb 2005 19:16:19 +0100 Message-ID: <200527191619.132398@Kelvin> In-Reply-To: Subject: RE: Study Group (WAS Re: Normalized Scoring) Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - lyra.lunarpages.com X-AntiAbuse: Original Domain - jakarta.apache.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [47 12] X-AntiAbuse: Sender Address Domain - relevanz.com X-Source: X-Source-Args: X-Source-Dir: X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N People!!! thanks for sharing, but this is what the wiki is for!! Will= everyone kindly add their posts to= http://wiki.apache.org/jakarta-lucene/InformationRetrieval for posterity? k On Mon, 7 Feb 2005 12:48:02 -0500, Joaquin Delgado wrote: >=A0A very solid (and free) online course on "intelligent information >=A0retrieval" with focus on practical issues can be found on Prof. >=A0Mooney (Univ. of Texas): http://www.cs.utexas.edu/users/mooney/ir- >=A0course/ > > >=A0I've also copied two interesting papers (from my own private >=A0library -- if you are interested I've got much more to offer) for >=A0your reading: > > >=A0For those looking for an answer to the unsolved mystery in IR: >=A0http://www.triplehop.com/pdf/How_Many_Relevances_in_IR.pdf > >=A0For those interesting in extending IR systems with Statistical NLP >=A0and Machine Learning >=A0http://www.triplehop.com/pdf/Text_Representation_and_ML_General_Concep >=A0ts .pdf > > >=A0BTW, if someone has had a look at http://www.find.com (meta-search, >=A0indexing and concept clustering system) I'd be interested in their >=A0opinion. > > >=A0-- Joaquin > > >=A0-----Original Message----- >=A0From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] Sent: >=A0Monday, February 07, 2005 12:15 PM To: Lucene Developers List >=A0Subject: Re: Study Group (WAS Re: Normalized Scoring) > >=A0I think I see what you are after. =A0I'm after the same knowledge. :) > >=A0The only things that I can recommend are books: >=A0Modern Information Retrieval >=A0Managing Gigabytes > >=A0And online resources like: >=A0http://finance.groups.yahoo.com/group/mg/ (note the weird host >=A0name) http://www.sims.berkeley.edu/~hearst/irbook/ > >=A0There is a pile of stuff in Citeseer, but those papers never really >=A0dig into the details and always require solid previous knowledge of >=A0the field. =A0They are no replacement for a class or a textbook. > >=A0If you find a good forum for IR, please share. > >=A0Otis > > >=A0--- Kelvin Tan =A0wrote: > >>=A0Wouldn't it be great if we can form a study-group of Lucene folks >>=A0who want to take the "next step"? I feel uneasy posting non- >>=A0Lucene specific questions to dev or user even if its related to >>=A0IR. >> >>=A0Feels to me like there could be a couple like us, who didn't do a >>=A0dissertation in IR, but would like a more indepth knowledge for >>=A0practical purposes. Basically, the end result is that we are able >>=A0to tune or extend lucene by using the Expert api (classes marked >>=A0as Expert). Perhaps a possible outcome is a tuning tutorial for >>=A0advanced users who already know how to use Lucene. >> >>=A0What do you think? >> >>=A0k >> >>=A0On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote: >> >>>=A0Exactly. =A0Luckily, since then I've learned a bit from lucene- >>>=A0dev discussions and side IR readings, so some of the topics are >>>=A0making more sense now. >>> >>>=A0Otis >>> >>>=A0--- Kelvin Tan =A0wrote: >>> >>>>=A0Hi Otis, I was re-reading this whole theoretical thread about >>>>=A0idf, scoring, normalization, etc from last Oct and couldn't >>>>=A0help laughing out loud when I read your post, coz it summed >>>>=A0up what I was thinking the whole time. I think its really >>>>=A0great to have people like Chuck and Paul (Eshlot) around. I'm >>>>=A0learning so much. >>>> >>>>=A0k >>>> >>>>=A0On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic >>>>=A0wrote: >>>> >>>>>=A0Hi Chuck, >>>>> >>>>>=A0The relative lack of responses is not because there is no >>>>>=A0interest, but probably because there are only a few people >>>>>=A0on lucene-dev who can follow/understand every detail of >>>>>=A0your proposal. =A0I understand and hear you, but I have a >>>>>=A0hard time 'visualizing' some of the formulas in your >>>>>=A0proposal. =A0What you are saying sounds right to me, but I >>>>>=A0don't have enough theoretical knowledge to go one way or >>>>>=A0the other. >>>>> >>>>>=A0Otis >>>>> >>>>> >>>>>=A0--- Chuck Williams =A0wrote: >>>>> >>>>>>=A0Hi everybody, >>>>>> >>>>>>=A0Although there doesn't seem to be much interest in this I >>>>>>=A0have one significant improvement to the below and a >>>>>>=A0couple observations that clarify the situation. >>>>>> >>>>>>=A0To illustrate the problem better normalization is >>>>>>=A0intended to address, >>>>>>=A0in my current application for BooleanQuery's of two >>>>>>=A0terms, I frequently >>>>>>=A0get a top score of 1.0 when only one of the terms is >>>>>>=A0matched. 1.0 should indicate a "perfect match". =A0I'd like >>>>>>=A0set my UI up to present the >>>>>>=A0results differently depending on how good the different >>>>>>=A0results are (e.g., showing a visual indication of result >>>>>>=A0quality, dropping the really bad results entirely, and >>>>>>=A0segregating the good results from other >>>>>>=A0only vaguely relevant results). =A0To build this kind of >>>>>>=A0"intelligence" into the UI, I certainly need to know >>>>>>=A0whether my top result matched all >>>>>>=A0query terms, or only half of them. =A0I'd like to have the >>>>>>=A0score tell me >>>>>>=A0directly how good the matches are. =A0The current >>>>>>=A0normalization does not achieve this. >>>>>> >>>>>>=A0The intent of a new normalization scheme is to preserve >>>>>>=A0the current scoring in the following sense: =A0the ratio of >>>>>>=A0the nth result's score to >>>>>>=A0the best result's score remains the same. =A0Therefore, the >>>>>>=A0only question >>>>>>=A0is what normalization factor to apply to all scores. >>>>>>=A0This reduces to a >>>>>>=A0very specific question that determines the entire >>>>>>=A0normalization -- what should the score of the best >>>>>>=A0matching result be? >>>>>> >>>>>>=A0The mechanism below has this property, i.e. it keeps the >>>>>>=A0current score >>>>>>=A0ratios, except that I removed one idf term for reasons >>>>>>=A0covered earlier >>>>>>=A0(this better reflects the current empirically best >>>>>>=A0scoring algorithms). >>>>>>=A0However, removing an idf is a completely separate issue. >>>>>>=A0The improved >>>>>>=A0normalization is independent of whether or not that >>>>>>=A0change is made. >>>>>> >>>>>>=A0For the central question of what the top score should be, >>>>>>=A0upon reflection, I don't like the definition below. =A0It >>>>>>=A0defined the top score >>>>>>=A0as (approximately) the percentage of query terms matched >>>>>>=A0by the top scoring result. =A0A better conceptual >>>>>>=A0definition is to use a weighted average based on the >>>>>>=A0boosts. =A0I.e., downward propagate all boosts to the >>>>>>=A0underlying terms (or phrases). =A0Secifically, the "net >>>>>>=A0boost" of a term >>>>>>=A0is the product of the direct boost of the term and all >>>>>>=A0boosts applied to >>>>>>=A0encompassing clauses. =A0Then the score of the top result >>>>>>=A0becomes the sum >>>>>>=A0of the net boosts of its matching terms divided by the >>>>>>=A0sum of the net boosts of all query terms. >>>>>> >>>>>>=A0This definition is a refinement of the original proposal >>>>>>=A0below, and it >>>>>>=A0could probably be further refined if some impact of the >>>>>>=A0tf, idf and/or >>>>>>=A0lengthNorm was desired in determining the top score. >>>>>>=A0These other factors seems to be harder to normalize for, >>>>>>=A0although I've thought of some simple approaches; e.g., >>>>>>=A0assume the unmatched terms in the top result have values >>>>>>=A0for these three factors that are the average of the >>>>>>=A0values of the matched terms, then apply exactly the same >>>>>>=A0concept of actual score divided by theorectical maximum >>>>>>=A0score. =A0That would eliminate any need to maintain maximum >>>>>>=A0value statistics in the index. >>>>>> >>>>>>=A0As an example of the simple boost-based normalization, >>>>>>=A0for the query ((a^2 b)^3 (c d^2)) the net boosts are: a -- >>>>>>=A0>=A06 b -- >>>>>> >>>>>>>=A03 c -- >>>>>>> >>>>>>>=A01 d -->=A02 >>>>>>> >>>>>>=A0So if a and b matched, but not c and d, in the top >>>>>>=A0scoring result, its >>>>>>=A0score would be 0.75. =A0The normalizer would be >>>>>>=A00.75/(current score except >>>>>>=A0for the current normalization). =A0This normalizer would be >>>>>>=A0applied to all >>>>>>=A0current scores (minus normalization) to create the >>>>>>=A0normalized scores. >>>>>> >>>>>>=A0For simple query (a b), if only one of the terms matched >>>>>>=A0in the top result, then its score would be 0.5, vs. 1.0 >>>>>>=A0or many other possible scores today. >>>>>> >>>>>>=A0In addition to enabling more "intelligent" UI's that >>>>>>=A0communicate the quality of results to end-users, the >>>>>>=A0proposal below also extends the explain() mechanism to >>>>>>=A0fully explain the final normalized score. However, that >>>>>>=A0change is also independent -- it could be done with the >>>>>>=A0current scoring. >>>>>> >>>>>>=A0Am I the only one who would like to see better >>>>>>=A0normalization in Lucene? Does anybody have a better >>>>>>=A0approach? >>>>>> >>>>>>=A0If you've read this far, thanks for indulging me on this. >>>>>>=A0 I would love >>>>>>=A0to see this or something with similar properties in >>>>>>=A0Lucene. I really just want to build my app, but as stated >>>>>>=A0below would write and contribute this if there is >>>>>>=A0interest in putting it in, and nobody else >>>>>>=A0wants to write it. =A0Please let me know what you think one >>>>>>=A0way or the other. >>>>>> >>>>>>=A0Thanks, >>>>>> >>>>>>=A0Chuck >>>>>> >>>>>> >>>>>>>=A0-----Original Message----- >>>>>>>=A0From: Chuck Williams >>>>>>>=A0Sent: Monday, October 18, 2004 7:04 PM >>>>>>>=A0To: 'Lucene Developers List' >>>>>>>=A0Subject: RE: idf and explain(), was Re: Search and >>>>>>>=A0Scoring >>>>>>> >>>>>>>=A0Doug Cutting wrote: >>>>>>>>=A0If this is a big issue for you, as it seems it is, >>>>>>>>=A0please >>>>>>>> >>>>>>=A0submit >>>>>>=A0a >>>>>>>=A0patch >>>>>>>>=A0to optionally disable score normalization in >>>>>>>>=A0Hits.java. >>>>>>>> >>>>>>>=A0and: >>>>>>>>=A0The quantity 'sum(t) weight(t,d)^2' must be >>>>>>>>=A0recomputed for >>>>>>>> >>>>>>=A0each >>>>>>>=A0document >>>>>>>>=A0each time a document is added to the collection, since >>>>>>>> >>>>>>=A0'weight(t,d)' >>>>>>>=A0is >>>>>>>>=A0dependent on global term statistics. =A0This is >>>>>>>>=A0prohibitivly >>>>>>>> >>>>>>=A0expensive. >>>>>>>>=A0Research has also demonstrated that such cosine >>>>>>>>=A0normalization >>>>>>>> >>>>>>=A0gives >>>>>>>>=A0somewhat inferior results (e.g., Singhal's pivoted >>>>>>>>=A0length >>>>>>>> >>>>>>>=A0normalization). >>>>>>> >>>>>>>=A0I'm willing to write, test and contribute code to >>>>>>>=A0address the normalization issue, i.e. to yield scores >>>>>>>=A0in [0, 1] that are >>>>>>> >>>>>>=A0meaningful >>>>>>>=A0across searches. =A0Unfortunately, this is considerably >>>>>>>=A0more >>>>>>> >>>>>>=A0involved >>>>>>=A0that >>>>>>>=A0just optionally eliminating the current normalization >>>>>>>=A0in Hits. >>>>>>> >>>>>>=A0Before >>>>>>>=A0undertaking this, I'd like to see if there is agreement >>>>>>>=A0in >>>>>>> >>>>>>=A0principle >>>>>>>=A0that this is a good idea, and that my specific proposal >>>>>>>=A0below is >>>>>>> >>>>>>=A0the >>>>>>>=A0right way to go. =A0Also, I'd like to make sure I've >>>>>>>=A0correctly >>>>>>> >>>>>>=A0inferred >>>>>>>=A0the constraints on writing code to be incorporated into >>>>>>>=A0Lucene. >>>>>>> >>>>>>>=A0After looking at this in more detail I agree that the >>>>>>>=A0cosine normalization is not the way to go, because of >>>>>>>=A0both efficiency >>>>>>> >>>>>>=A0and >>>>>>>=A0effectiveness considerations. =A0A conceptual approach >>>>>>>=A0that would >>>>>>> >>>>>>=A0be >>>>>>>=A0efficient, relatively easy to implement, and seems to >>>>>>>=A0have at >>>>>>> >>>>>>=A0least >>>>>>>=A0reasonable behavior would be to define the top scoring >>>>>>>=A0match to >>>>>>> >>>>>>=A0have >>>>>>=A0a >>>>>>>=A0score roughly equal to the percentage of query terms it >>>>>>>=A0matches >>>>>>> >>>>>>=A0(its >>>>>>>=A0"netCoord"). =A0Scores below the top hit would be reduced >>>>>>>=A0based on >>>>>>> >>>>>>=A0the >>>>>>>=A0ratio of their raw scores to the raw score of the top >>>>>>>=A0hit >>>>>>> >>>>>>=A0(considering >>>>>>>=A0all of the current score factors, except that I'd like >>>>>>>=A0to remove >>>>>>> >>>>>>=A0one >>>>>>=A0of >>>>>>>=A0the idf factors, as discussed earlier). >>>>>>> >>>>>>>=A0For a couple simple cases: >>>>>>>=A0a) the top match for a single term query would always >>>>>>>=A0have a >>>>>>> >>>>>>=A0score >>>>>>=A0of >>>>>>>=A01.0, >>>>>>>=A0b) the top scoring match for a BooleanQuery using >>>>>>> >>>>>>=A0DefaultSimilarity >>>>>>>=A0with all non-prohibited TermQuery clauses would have a >>>>>>>=A0score of >>>>>>> >>>>>>=A0m/n, >>>>>>>=A0where the hit matches m of the n terms. >>>>>>> >>>>>>>=A0This isn't optimal, but seems much better than the >>>>>>>=A0current >>>>>>> >>>>>>=A0situation. >>>>>>>=A0Consider two single-term queries, s and t. =A0If s >>>>>>>=A0matches more >>>>>>> >>>>>>=A0strongly >>>>>>>=A0than t in its top hit (e.g., a higher tf in a shorter >>>>>>>=A0field), it >>>>>>> >>>>>>=A0would >>>>>>>=A0be best if the top score of s was greater score than >>>>>>>=A0top score of >>>>>>> >>>>>>=A0t. >>>>>>>=A0But this kind of normalization would require keeping >>>>>>>=A0additional statistics that so far as I know are not >>>>>>>=A0currently in the index, >>>>>>> >>>>>>=A0like >>>>>>>=A0the maximum tf for terms and the minimum length for >>>>>>>=A0fields. >>>>>>> >>>>>>=A0These >>>>>>=A0could >>>>>>>=A0be expensive to update on deletes. =A0Also, normalizing >>>>>>>=A0by such >>>>>>> >>>>>>=A0factors >>>>>>>=A0could yield lower than subjectively reasonable scores >>>>>>>=A0in most >>>>>>> >>>>>>=A0cases, >>>>>>=A0so >>>>>>>=A0it's not clear it would be better. >>>>>>> >>>>>>>=A0The semantics above are at least clean, easy to >>>>>>>=A0understand, and >>>>>>> >>>>>>=A0support >>>>>>>=A0what seems to me is the most important motivation to do >>>>>>>=A0this: >>>>>>> >>>>>>=A0allowing >>>>>>>=A0an application to use simple thresholding to segregate >>>>>>> >>>>>>=A0likely-to-be- >>>>>>>=A0relevant hits from likely-to-be-irrelevant hits. >>>>>>> >>>>>>>=A0More specifically, for a BooleanQuery of TermQuery's the >>>>>>> >>>>>>=A0resulting >>>>>>=A0score >>>>>>>=A0functions would be: >>>>>>> >>>>>>>=A0BooleanQuery of TermQuery's sbq =3D (tq1 ... tqn) >>>>>>> >>>>>>>=A0baseScore(sbq, doc) =3D >>>>>>>=A0sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* >>>>>>>=A0lengthNorm(tqi.term.field, doc) >>>>>>> >>>>>>>=A0rawScore(sbq, doc) =3D coord(sbq, doc) * baseScore >>>>>>> >>>>>>>=A0norm(sbq, hits) =3D 1 / max(hit in hits) baseScore(sbq, >>>>>>>=A0hit) >>>>>>> >>>>>>>=A0score(sbq, doc) =3D rawScore * norm >>>>>>> >>>>>>>=A0rawScore's can be computed in the Scorer.score() >>>>>>>=A0methods and >>>>>>> >>>>>>=A0therefore >>>>>>>=A0used to sort the hits (e.g., in the instance method for >>>>>>>=A0collect() >>>>>>> >>>>>>=A0in >>>>>>=A0the >>>>>>>=A0HitCollector in IndexSearcher.search()). =A0If the top >>>>>>>=A0scoring hit >>>>>>> >>>>>>=A0does >>>>>>>=A0not have the highest baseScore, then its score could be >>>>>>>=A0less that >>>>>>> >>>>>>=A0its >>>>>>>=A0coord; this seems desirable. =A0These formulas imply that >>>>>>>=A0no result >>>>>>> >>>>>>=A0score >>>>>>>=A0can be larger than its coord, so if coord is well- >>>>>>>=A0defined (always between 0 and 1) then all results will >>>>>>>=A0be normalized between 0 >>>>>>> >>>>>>=A0and >>>>>>=A01. >>>>>> >>>>>>>=A0In general, the netCoord, which takes the place of >>>>>>>=A0coord in the >>>>>>> >>>>>>=A0simple >>>>>>>=A0case above, needs to be defined for any query. >>>>>>>=A0Conceptually, >>>>>>> >>>>>>=A0this >>>>>>>=A0should be the total percentage of query terms matched >>>>>>>=A0by the >>>>>>> >>>>>>=A0document. >>>>>>>=A0It must be recursively computable from the subquery >>>>>>>=A0structure and overridable for specific Query types >>>>>>>=A0(e.g., to support >>>>>>> >>>>>>=A0specialized >>>>>>>=A0coords, like one that is always 1.0 as is useful in >>>>>>>=A0multi- field searching). =A0Suitable default definitions >>>>>>>=A0for TermQuery and >>>>>>> >>>>>>=A0BooleanQuery >>>>>>>=A0are: >>>>>>> >>>>>>>=A0TermQuery.netCoord =3D 1.0 if term matches, 0.0 otherwise >>>>>>> >>>>>>>=A0BooleanQuery(c1 ... cn).netCoord =3D sum(ci) coord(1, n) * >>>>>>> >>>>>>=A0ci.netCoord >>>>>> >>>>>>>=A0This is not quite percentage of terms matched; e.g., >>>>>>>=A0consider a BooleanQuery with two clauses, one of which >>>>>>>=A0is a BooleanQuery of >>>>>>> >>>>>>=A03 >>>>>>=A0terms >>>>>>>=A0and the other which is just a term. =A0However, it >>>>>>>=A0doesn't seem to >>>>>>> >>>>>>=A0be >>>>>>>=A0unreasonable for a BooleanQuery to state that its >>>>>>>=A0clauses are >>>>>>> >>>>>>=A0equally >>>>>>>=A0important, and this is consistent with the current coord >>>>>>> >>>>>>=A0behavior. >>>>>>>=A0BooleanQuery.netCoord could be overridden for special >>>>>>>=A0cases, like >>>>>>> >>>>>>=A0the >>>>>>>=A0pure disjunction I use in my app for field expansions: >>>>>>>=A0MaxDisjunctionQuery(c1 .. cn).netCoord =3D max(ci) >>>>>>>=A0ci.netCoord >>>>>>> >>>>>>>=A0Implementing this would proceed along these lines: 1. >>>>>>>=A0For backwards compatibility, add some kind of newScoring >>>>>>> >>>>>>=A0boolean >>>>>>>=A0setting. >>>>>>>=A02. =A0Update all of these places to behave as indicated if >>>>>>> >>>>>>=A0newScoring: >>>>>>>=A0a. =A0Change Query.weight() to not do any normalization >>>>>>>=A0(no >>>>>>> >>>>>>=A0call >>>>>>=A0to >>>>>>>=A0sumOfSquaredWeights(), Similarity.queryNorm() or >>>>>>>=A0normalize()). b. =A0Update all Query.weight classes to >>>>>>>=A0set their value >>>>>>> >>>>>>=A0according >>>>>>=A0to >>>>>>>=A0the terms in the score formula above that don't involve >>>>>>>=A0the >>>>>>> >>>>>>=A0document >>>>>>>=A0(e.g., boost*idf for TermQuery). >>>>>>>=A0c. =A0Add suitable netCoord definitions to all Scorer >>>>>>>=A0classes. d. Update all Scorer.score() methods to >>>>>>>=A0compute the rawScore >>>>>>> >>>>>>=A0as >>>>>>>=A0above. >>>>>>>=A0e. =A0Add the maximum baseScore as a field kept on >>>>>>>=A0TopDocs, >>>>>>> >>>>>>=A0computed >>>>>>>=A0in the HitCollector's. >>>>>>>=A0f. =A0Change the normalization in Hits to always divide >>>>>>>=A0every >>>>>>> >>>>>>=A0raw >>>>>>>=A0score by the maximum baseScore. >>>>>>>=A0g. =A0Update all of the current explain() methods to be >>>>>>> >>>>>>=A0consistent >>>>>>>=A0with this scoring, and to either report the rawScore, >>>>>>>=A0or to >>>>>>> >>>>>>=A0report >>>>>>=A0the >>>>>>>=A0final score if the normalization factor is provided. h. >>>>>>>=A0Add Hits.explain() (or better extend Searcher so that it >>>>>>> >>>>>>=A0keeps >>>>>>>=A0the Hits for use in Searcher.explain()) to call the new >>>>>>>=A0explain variation with the normalization factor so that >>>>>>>=A0final scores are >>>>>>> >>>>>>=A0fully >>>>>>>=A0explained. >>>>>>> >>>>>>>=A0If this seems like a good idea, please let me know. >>>>>>>=A0I'm sure >>>>>>> >>>>>>=A0there >>>>>>=A0are >>>>>>>=A0details I've missed that would come out during coding >>>>>>>=A0and >>>>>>> >>>>>>=A0testing. >>>>>>=A0Also, >>>>>>>=A0the value of this is dependent on how reasonable the >>>>>>>=A0final scores >>>>>>> >>>>>>=A0look, >>>>>>>=A0which is hard to tell for sure until it is working. >>>>>>> >>>>>>>=A0The coding standards for Lucene seem reasonably clear >>>>>>>=A0from the >>>>>>> >>>>>>=A0source >>>>>>>=A0code I've read. =A0I could use just simple Java so there >>>>>>>=A0shouldn't >>>>>>> >>>>>>=A0be >>>>>>=A0any >>>>>>>=A0significant JVM dependencies. =A0The above should be >>>>>>>=A0fully backward compatible due to the newScoring flag. >>>>>>> >>>>>>>=A0On another note, I had to remove the German analyzer in >>>>>>>=A0my >>>>>>> >>>>>>=A0current >>>>>>=A01.4.2 >>>>>>>=A0source configuration because GermanStemmer failed to >>>>>>>=A0compile due >>>>>>> >>>>>>=A0to >>>>>>=A0what >>>>>>>=A0are apparently Unicode character constants that I've >>>>>>>=A0now got as >>>>>>> >>>>>>=A0illegal >>>>>>>=A0two-character character constants. =A0This is presumably >>>>>>>=A0an >>>>>>> >>>>>>=A0encoding >>>>>>>=A0problem somewhere that I could track down. =A0It's not >>>>>>>=A0important, >>>>>>> >>>>>>=A0but >>>>>>=A0if >>>>>>>=A0the answer is obvious to any of you, I'd appreciate the >>>>>>>=A0quick >>>>>>> >>>>>>=A0tip. >>>>>> >>>>>>>=A0Thanks, >>>>>>> >>>>>>>=A0Chuck >>>>>>> >>>>>>>>=A0-----Original Message----- >>>>>>>>=A0From: Doug Cutting [mailto:cutting@apache.org] Sent: >>>>>>>>=A0Monday, October 18, 2004 9:44 AM To: Lucene >>>>>>>>=A0Developers List Subject: Re: idf and explain(), was >>>>>>>>=A0Re: Search and Scoring >>>>>>>> >>>>>>>>=A0Chuck Williams wrote: >>>>>>>>>=A0That's a good point on how the standard vector >>>>>>>>>=A0space inner >>>>>>>>> >>>>>>=A0product >>>>>>>>>=A0similarity measure does imply that the idf is >>>>>>>>>=A0squared >>>>>>>>> >>>>>>=A0relative >>>>>>=A0to >>>>>>>=A0the >>>>>>>>>=A0document tf. =A0Even having been aware of this >>>>>>>>>=A0formula for a >>>>>>>>> >>>>>>=A0long >>>>>>>=A0time, >>>>>>>>>=A0this particular implication never occurred to me. >>>>>>>>>=A0Do you >>>>>>>>> >>>>>>=A0know >>>>>>=A0if >>>>>>>>>=A0anybody has done precision/recall or other relevancy >>>>>>>>> >>>>>>=A0empirical >>>>>>>>>=A0measurements comparing this vs. a model that does >>>>>>>>>=A0not >>>>>>>>> >>>>>>=A0square >>>>>>=A0idf? >>>>>> >>>>>>>>=A0No, not that I know of. >>>>>>>> >>>>>>>>>=A0Regarding normalization, the normalization in Hits >>>>>>>>>=A0does not >>>>>>>>> >>>>>>=A0have >>>>>>>=A0very >>>>>>>>>=A0nice properties. =A0Due to the >=A01.0 threshold check, >>>>>>>>>=A0it >>>>>>>>> >>>>>>=A0loses >>>>>>>>>=A0information, and it arbitrarily defines the highest >>>>>>>>>=A0scoring >>>>>>>>> >>>>>>=A0result >>>>>>>=A0in >>>>>>>>>=A0any list that generates scores above 1.0 as a >>>>>>>>>=A0perfect >>>>>>>>> >>>>>>=A0match. >>>>>>=A0It >>>>>>>=A0would >>>>>>>>>=A0be nice if score values were meaningful independent >>>>>>>>>=A0of >>>>>>>>> >>>>>>=A0searches, >>>>>>>=A0e.g., >>>>>>>>>=A0if "0.6" meant the same quality of retrieval >>>>>>>>>=A0independent of >>>>>>>>> >>>>>>=A0what >>>>>>>>=A0search >>>>>>>>>=A0was done. =A0This would allow, for example, sites to >>>>>>>>>=A0use a a >>>>>>>>> >>>>>>=A0simple >>>>>>>>>=A0quality threshold to only show results that were >>>>>>>>>=A0"good >>>>>>>>> >>>>>>=A0enough". >>>>>>>=A0At my >>>>>>>>>=A0last company (I was President and head of >>>>>>>>>=A0engineering for >>>>>>>>> >>>>>>=A0InQuira), >>>>>>>=A0we >>>>>>>>>=A0found this to be important to many customers. >>>>>>>>> >>>>>>>>=A0If this is a big issue for you, as it seems it is, >>>>>>>>=A0please >>>>>>>> >>>>>>=A0submit >>>>>>=A0a >>>>>>>=A0patch >>>>>>>>=A0to optionally disable score normalization in >>>>>>>>=A0Hits.java. >>>>>>>> >>>>>>>>>=A0The standard vector space similarity measure >>>>>>>>>=A0includes >>>>>>>>> >>>>>>>=A0normalization by >>>>>>>>>=A0the product of the norms of the vectors, i.e.: >>>>>>>>> >>>>>>>>>=A0score(d,q) =3D =A0sum over t of ( weight(t,q) * >>>>>>>>>=A0weight(t,d) ) >>>>>>>>> >>>>>>=A0/ >>>>>>>>>=A0sqrt [ (sum(t) weight(t,q)^2) * (sum(t) >>>>>>>>> >>>>>>>>=A0weight(t,d)^2) ] >>>>>>>> >>>>>>>>>=A0This makes the score a cosine, which since the >>>>>>>>>=A0values are >>>>>>>>> >>>>>>=A0all >>>>>>>=A0positive, >>>>>>>>>=A0forces it to be in [0, 1]. =A0The sumOfSquares() >>>>>>>>> >>>>>>=A0normalization >>>>>>=A0in >>>>>>>=A0Lucene >>>>>>>>>=A0does not fully implement this. =A0Is there a specific >>>>>>>>>=A0reason >>>>>>>>> >>>>>>=A0for >>>>>>>=A0that? >>>>>>> >>>>>>>>=A0The quantity 'sum(t) weight(t,d)^2' must be >>>>>>>>=A0recomputed for >>>>>>>> >>>>>>=A0each >>>>>>>=A0document >>>>>>>>=A0each time a document is added to the collection, since >>>>>>>> >>>>>>=A0'weight(t,d)' >>>>>>>=A0is >>>>>>>>=A0dependent on global term statistics. =A0This is >>>>>>>>=A0prohibitivly >>>>>>>> >>>>>>=A0expensive. >>>>>>>>=A0Research has also demonstrated that such cosine >>>>>>>>=A0normalization >>>>>>>> >>>>>>=A0gives >>>>>>>>=A0somewhat inferior results (e.g., Singhal's pivoted >>>>>>>>=A0length >>>>>>>> >>>>>>>=A0normalization). >>>>>>> >>>>>>>>>=A0Re. explain(), I don't see a downside to extending >>>>>>>>>=A0it show >>>>>>>>> >>>>>>=A0the >>>>>>>=A0final >>>>>>>>>=A0normalization in Hits. =A0It could still show the raw >>>>>>>>>=A0score >>>>>>>>> >>>>>>=A0just >>>>>>>=A0prior >>>>>>>>=A0to >>>>>>>>>=A0that normalization. >>>>>>>>> >>>>>>>>=A0In order to normalize scores to 1.0 one must know the >>>>>>>>=A0maximum >>>>>>>> >>>>>>=A0score. >>>>>>>>=A0Explain only computes the score for a single >>>>>>>>=A0document, and >>>>>>>> >>>>>>=A0the >>>>>>>=A0maximum >>>>>>>>=A0score is not known. >>>>>>>> >>>>>>>>>=A0Although I think it would be best to have a >>>>>>>>>=A0normalization that would render scores comparable >>>>>>>>>=A0across >>>>>>>>> >>>>>>=A0searches. >>>>>> >>>>>>>>=A0Then please submit a patch. =A0Lucene doesn't change on >>>>>>>>=A0its >>>>>>>> >>>>>>=A0own. >>>>>> >>>>>>>>=A0Doug >>>>>>>> >>>>>>>> >>>>>>=A0---------------------------------------------------------- >>>>>>=A0---- ---- -- >>>>>> >>>>>>>=A0- >>>>>>>>=A0To unsubscribe, e-mail: >>>>>>=A0lucene-dev-unsubscribe@jakarta.apache.org >>>>>>>>=A0For additional commands, e-mail: >>>>>>=A0lucene-dev-help@jakarta.apache.org >>>>>> >>>>>> >>>>>>=A0---------------------------------------------------------- >>>>>>=A0---- ---- --- To unsubscribe, e-mail: lucene-dev- >>>>>>=A0unsubscribe@jakarta.apache.org For additional commands, e- >>>>>>=A0 mail: lucene-dev-help@jakarta.apache.org >>>>> >>> >=A0-------------------------------------------------------------------- > >>>>>=A0- To unsubscribe, e-mail: lucene-dev- >>>>>=A0unsubscribe@jakarta.apache.org For additional commands, e- >>>>>=A0mail: lucene-dev-help@jakarta.apache.org >>>> >>>> >>>>=A0-------------------------------------------------------------- >>>>=A0---- --- To unsubscribe, e-mail: lucene-dev- >>>>=A0unsubscribe@jakarta.apache.org For additional commands, e- >>>>=A0mail: lucene-dev-help@jakarta.apache.org >>> >>> >=A0-------------------------------------------------------------------- > >>>=A0- To unsubscribe, e-mail: lucene-dev- >>>=A0unsubscribe@jakarta.apache.org For additional commands, e-mail: >>>=A0lucene-dev-help@jakarta.apache.org >> >> >>=A0------------------------------------------------------------------ >>=A0--- To unsubscribe, e-mail: lucene-dev- >>=A0unsubscribe@jakarta.apache.org For additional commands, e-mail: >>=A0lucene-dev-help@jakarta.apache.org > > >=A0-------------------------------------------------------------------- >=A0- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org >=A0For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > > >=A0-------------------------------------------------------------------- >=A0- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org >=A0For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org