Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 13358 invoked from network); 7 Feb 2005 23:09:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 7 Feb 2005 23:09:51 -0000 Received: (qmail 18299 invoked by uid 500); 7 Feb 2005 23:09:49 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 18274 invoked by uid 500); 7 Feb 2005 23:09:49 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 18261 invoked by uid 99); 7 Feb 2005 23:09:48 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from server1.hostmon.com (HELO server1.hostmon.com) (66.139.76.19) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 07 Feb 2005 15:09:47 -0800 Received: (qmail 31210 invoked by uid 532); 7 Feb 2005 23:07:41 -0000 Received: from dave-lucene-dev@tropo.com by server1.hostmon.com by uid 0 with qmail-scanner-1.16 (spamassassin: 3.0.0. Clear:. Processed in 0.422297 secs); 07 Feb 2005 23:07:41 -0000 Received: from unknown (HELO ?10.0.0.157?) (127.0.0.1) by 0 with SMTP; 7 Feb 2005 23:07:41 -0000 Message-ID: <4207F550.1030104@tropo.com> Date: Mon, 07 Feb 2005 15:10:08 -0800 From: David Spencer User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Study Group (WAS Re: Normalized Scoring) References: <200526101416.488718@Kelvin> In-Reply-To: <200526101416.488718@Kelvin> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N You might want to see a post I just made to the thread with this long subject: "single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?" I've done an example page that compares results of searching with different query parsers and Similarities. Kelvin Tan wrote: > Wouldn't it be great if we can form a study-group of Lucene folks who want to take the "next step"? I feel uneasy posting non-Lucene specific questions to dev or user even if its related to IR. > > Feels to me like there could be a couple like us, who didn't do a dissertation in IR, but would like a more indepth knowledge for practical purposes. Basically, the end result is that we are able to tune or extend lucene by using the Expert api (classes marked as Expert). Perhaps a possible outcome is a tuning tutorial for advanced users who already know how to use Lucene. > > What do you think? > > k > > On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote: > >> Exactly. Luckily, since then I've learned a bit from lucene-dev >> discussions and side IR readings, so some of the topics are making >> more sense now. >> >> Otis >> >> --- Kelvin Tan wrote: >> >> >>> Hi Otis, I was re-reading this whole theoretical thread about >>> idf, scoring, normalization, etc from last Oct and couldn't help >>> laughing out loud when I read your post, coz it summed up what I >>> was thinking the whole time. I think its really great to have >>> people like Chuck and Paul (Eshlot) around. I'm learning so much. >>> >>> k >>> >>> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote: >>> >>> >>>> Hi Chuck, >>>> >>>> The relative lack of responses is not because there is no >>>> interest, but probably because there are only a few people on >>>> lucene-dev who can follow/understand every detail of your >>>> proposal. I understand and hear you, but I have a hard time >>>> 'visualizing' some of the formulas in your proposal. What you >>>> are saying sounds right to me, but I don't have enough >>>> theoretical knowledge to go one way or the other. >>>> >>>> Otis >>>> >>>> >>>> --- Chuck Williams wrote: >>>> >>>> >>>>> Hi everybody, >>>>> >>>>> Although there doesn't seem to be much interest in this I >>>>> have one significant improvement to the below and a couple >>>>> observations that clarify the situation. >>>>> >>>>> To illustrate the problem better normalization is intended to >>>>> address, >>>>> in my current application for BooleanQuery's of two terms, I >>>>> frequently >>>>> get a top score of 1.0 when only one of the terms is matched. >>>>> 1.0 should indicate a "perfect match". I'd like set my UI up >>>>> to present the >>>>> results differently depending on how good the different >>>>> results are (e.g., showing a visual indication of result >>>>> quality, dropping the really bad results entirely, and >>>>> segregating the good results from other >>>>> only vaguely relevant results). To build this kind of >>>>> "intelligence" into the UI, I certainly need to know whether >>>>> my top result matched all >>>>> query terms, or only half of them. I'd like to have the >>>>> score tell me >>>>> directly how good the matches are. The current normalization >>>>> does not achieve this. >>>>> >>>>> The intent of a new normalization scheme is to preserve the >>>>> current scoring in the following sense: the ratio of the nth >>>>> result's score to >>>>> the best result's score remains the same. Therefore, the >>>>> only question >>>>> is what normalization factor to apply to all scores. This >>>>> reduces to a >>>>> very specific question that determines the entire >>>>> normalization -- what should the score of the best matching >>>>> result be? >>>>> >>>>> The mechanism below has this property, i.e. it keeps the >>>>> current score >>>>> ratios, except that I removed one idf term for reasons >>>>> covered earlier >>>>> (this better reflects the current empirically best scoring >>>>> algorithms). >>>>> However, removing an idf is a completely separate issue. The >>>>> improved >>>>> normalization is independent of whether or not that change is >>>>> made. >>>>> >>>>> For the central question of what the top score should be, >>>>> upon reflection, I don't like the definition below. It >>>>> defined the top score >>>>> as (approximately) the percentage of query terms matched by >>>>> the top scoring result. A better conceptual definition is to >>>>> use a weighted average based on the boosts. I.e., downward >>>>> propagate all boosts to the >>>>> underlying terms (or phrases). Secifically, the "net boost" >>>>> of a term >>>>> is the product of the direct boost of the term and all boosts >>>>> applied to >>>>> encompassing clauses. Then the score of the top result >>>>> becomes the sum >>>>> of the net boosts of its matching terms divided by the sum of >>>>> the net boosts of all query terms. >>>>> >>>>> This definition is a refinement of the original proposal >>>>> below, and it >>>>> could probably be further refined if some impact of the tf, >>>>> idf and/or >>>>> lengthNorm was desired in determining the top score. These >>>>> other factors seems to be harder to normalize for, although >>>>> I've thought of some simple approaches; e.g., assume the >>>>> unmatched terms in the top result have values for these three >>>>> factors that are the average of the >>>>> values of the matched terms, then apply exactly the same >>>>> concept of actual score divided by theorectical maximum >>>>> score. That would eliminate any need to maintain maximum >>>>> value statistics in the index. >>>>> >>>>> As an example of the simple boost-based normalization, for >>>>> the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b -- >>>>> > 3 c -- >>>>> >>>>> >>>>>> 1 d --> 2 >>>>>> >>>>> >>>>> So if a and b matched, but not c and d, in the top scoring >>>>> result, its >>>>> score would be 0.75. The normalizer would be 0.75/(current >>>>> score except >>>>> for the current normalization). This normalizer would be >>>>> applied to all >>>>> current scores (minus normalization) to create the normalized >>>>> scores. >>>>> >>>>> For simple query (a b), if only one of the terms matched in >>>>> the top result, then its score would be 0.5, vs. 1.0 or many >>>>> other possible scores today. >>>>> >>>>> In addition to enabling more "intelligent" UI's that >>>>> communicate the quality of results to end-users, the proposal >>>>> below also extends the explain() mechanism to fully explain >>>>> the final normalized score. However, that change is also >>>>> independent -- it could be done with the current scoring. >>>>> >>>>> Am I the only one who would like to see better normalization >>>>> in Lucene? Does anybody have a better approach? >>>>> >>>>> If you've read this far, thanks for indulging me on this. I >>>>> would love >>>>> to see this or something with similar properties in Lucene. >>>>> I really just want to build my app, but as stated below would >>>>> write and contribute this if there is interest in putting it >>>>> in, and nobody else >>>>> wants to write it. Please let me know what you think one way >>>>> or the other. >>>>> >>>>> Thanks, >>>>> >>>>> Chuck >>>>> >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Chuck Williams >>>>>> Sent: Monday, October 18, 2004 7:04 PM >>>>>> To: 'Lucene Developers List' >>>>>> Subject: RE: idf and explain(), was Re: Search and Scoring >>>>>> >>>>>> Doug Cutting wrote: >>>>>> >>>>>>> If this is a big issue for you, as it seems it is, please >>>>>>> >>>>> >>>>> submit >>>>> a >>>>> >>>>>> patch >>>>>> >>>>>>> to optionally disable score normalization in Hits.java. >>>>>>> >>>>>> >>>>>> and: >>>>>> >>>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for >>>>>>> >>>>> >>>>> each >>>>> >>>>>> document >>>>>> >>>>>>> each time a document is added to the collection, since >>>>>>> >>>>> >>>>> 'weight(t,d)' >>>>> >>>>>> is >>>>>> >>>>>>> dependent on global term statistics. This is prohibitivly >>>>>>> >>>>> >>>>> expensive. >>>>> >>>>>>> Research has also demonstrated that such cosine >>>>>>> normalization >>>>>>> >>>>> >>>>> gives >>>>> >>>>>>> somewhat inferior results (e.g., Singhal's pivoted length >>>>>>> >>>>>> >>>>>> normalization). >>>>>> >>>>>> I'm willing to write, test and contribute code to address >>>>>> the normalization issue, i.e. to yield scores in [0, 1] >>>>>> that are >>>>>> >>>>> >>>>> meaningful >>>>> >>>>>> across searches. Unfortunately, this is considerably more >>>>>> >>>>> >>>>> involved >>>>> that >>>>> >>>>>> just optionally eliminating the current normalization in >>>>>> Hits. >>>>>> >>>>> >>>>> Before >>>>> >>>>>> undertaking this, I'd like to see if there is agreement in >>>>>> >>>>> >>>>> principle >>>>> >>>>>> that this is a good idea, and that my specific proposal >>>>>> below is >>>>>> >>>>> >>>>> the >>>>> >>>>>> right way to go. Also, I'd like to make sure I've correctly >>>>>> >>>>> >>>>> inferred >>>>> >>>>>> the constraints on writing code to be incorporated into >>>>>> Lucene. >>>>>> >>>>>> After looking at this in more detail I agree that the >>>>>> cosine normalization is not the way to go, because of both >>>>>> efficiency >>>>>> >>>>> >>>>> and >>>>> >>>>>> effectiveness considerations. A conceptual approach that >>>>>> would >>>>>> >>>>> >>>>> be >>>>> >>>>>> efficient, relatively easy to implement, and seems to have >>>>>> at >>>>>> >>>>> >>>>> least >>>>> >>>>>> reasonable behavior would be to define the top scoring >>>>>> match to >>>>>> >>>>> >>>>> have >>>>> a >>>>> >>>>>> score roughly equal to the percentage of query terms it >>>>>> matches >>>>>> >>>>> >>>>> (its >>>>> >>>>>> "netCoord"). Scores below the top hit would be reduced >>>>>> based on >>>>>> >>>>> >>>>> the >>>>> >>>>>> ratio of their raw scores to the raw score of the top hit >>>>>> >>>>> >>>>> (considering >>>>> >>>>>> all of the current score factors, except that I'd like to >>>>>> remove >>>>>> >>>>> >>>>> one >>>>> of >>>>> >>>>>> the idf factors, as discussed earlier). >>>>>> >>>>>> For a couple simple cases: >>>>>> a) the top match for a single term query would always have a >>>>>> >>>>> >>>>> score >>>>> of >>>>> >>>>>> 1.0, >>>>>> b) the top scoring match for a BooleanQuery using >>>>>> >>>>> >>>>> DefaultSimilarity >>>>> >>>>>> with all non-prohibited TermQuery clauses would have a >>>>>> score of >>>>>> >>>>> >>>>> m/n, >>>>> >>>>>> where the hit matches m of the n terms. >>>>>> >>>>>> This isn't optimal, but seems much better than the current >>>>>> >>>>> >>>>> situation. >>>>> >>>>>> Consider two single-term queries, s and t. If s matches >>>>>> more >>>>>> >>>>> >>>>> strongly >>>>> >>>>>> than t in its top hit (e.g., a higher tf in a shorter >>>>>> field), it >>>>>> >>>>> >>>>> would >>>>> >>>>>> be best if the top score of s was greater score than top >>>>>> score of >>>>>> >>>>> >>>>> t. >>>>> >>>>>> But this kind of normalization would require keeping >>>>>> additional statistics that so far as I know are not >>>>>> currently in the index, >>>>>> >>>>> >>>>> like >>>>> >>>>>> the maximum tf for terms and the minimum length for fields. >>>>>> >>>>> >>>>> These >>>>> could >>>>> >>>>>> be expensive to update on deletes. Also, normalizing by >>>>>> such >>>>>> >>>>> >>>>> factors >>>>> >>>>>> could yield lower than subjectively reasonable scores in >>>>>> most >>>>>> >>>>> >>>>> cases, >>>>> so >>>>> >>>>>> it's not clear it would be better. >>>>>> >>>>>> The semantics above are at least clean, easy to understand, >>>>>> and >>>>>> >>>>> >>>>> support >>>>> >>>>>> what seems to me is the most important motivation to do >>>>>> this: >>>>>> >>>>> >>>>> allowing >>>>> >>>>>> an application to use simple thresholding to segregate >>>>>> >>>>> >>>>> likely-to-be- >>>>> >>>>>> relevant hits from likely-to-be-irrelevant hits. >>>>>> >>>>>> More specifically, for a BooleanQuery of TermQuery's the >>>>>> >>>>> >>>>> resulting >>>>> score >>>>> >>>>>> functions would be: >>>>>> >>>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn) >>>>>> >>>>>> baseScore(sbq, doc) = >>>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* >>>>>> lengthNorm(tqi.term.field, doc) >>>>>> >>>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore >>>>>> >>>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit) >>>>>> >>>>>> score(sbq, doc) = rawScore * norm >>>>>> >>>>>> rawScore's can be computed in the Scorer.score() methods and >>>>>> >>>>> >>>>> therefore >>>>> >>>>>> used to sort the hits (e.g., in the instance method for >>>>>> collect() >>>>>> >>>>> >>>>> in >>>>> the >>>>> >>>>>> HitCollector in IndexSearcher.search()). If the top >>>>>> scoring hit >>>>>> >>>>> >>>>> does >>>>> >>>>>> not have the highest baseScore, then its score could be >>>>>> less that >>>>>> >>>>> >>>>> its >>>>> >>>>>> coord; this seems desirable. These formulas imply that no >>>>>> result >>>>>> >>>>> >>>>> score >>>>> >>>>>> can be larger than its coord, so if coord is well-defined >>>>>> (always between 0 and 1) then all results will be >>>>>> normalized between 0 >>>>>> >>>>> >>>>> and >>>>> 1. >>>>> >>>>> >>>>>> In general, the netCoord, which takes the place of coord in >>>>>> the >>>>>> >>>>> >>>>> simple >>>>> >>>>>> case above, needs to be defined for any query. >>>>>> Conceptually, >>>>>> >>>>> >>>>> this >>>>> >>>>>> should be the total percentage of query terms matched by the >>>>>> >>>>> >>>>> document. >>>>> >>>>>> It must be recursively computable from the subquery >>>>>> structure and overridable for specific Query types (e.g., >>>>>> to support >>>>>> >>>>> >>>>> specialized >>>>> >>>>>> coords, like one that is always 1.0 as is useful in multi- >>>>>> field searching). Suitable default definitions for >>>>>> TermQuery and >>>>>> >>>>> >>>>> BooleanQuery >>>>> >>>>>> are: >>>>>> >>>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise >>>>>> >>>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * >>>>>> >>>>> >>>>> ci.netCoord >>>>> >>>>> >>>>>> This is not quite percentage of terms matched; e.g., >>>>>> consider a BooleanQuery with two clauses, one of which is a >>>>>> BooleanQuery of >>>>>> >>>>> >>>>> 3 >>>>> terms >>>>> >>>>>> and the other which is just a term. However, it doesn't >>>>>> seem to >>>>>> >>>>> >>>>> be >>>>> >>>>>> unreasonable for a BooleanQuery to state that its clauses >>>>>> are >>>>>> >>>>> >>>>> equally >>>>> >>>>>> important, and this is consistent with the current coord >>>>>> >>>>> >>>>> behavior. >>>>> >>>>>> BooleanQuery.netCoord could be overridden for special >>>>>> cases, like >>>>>> >>>>> >>>>> the >>>>> >>>>>> pure disjunction I use in my app for field expansions: >>>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord >>>>>> >>>>>> Implementing this would proceed along these lines: 1. For >>>>>> backwards compatibility, add some kind of newScoring >>>>>> >>>>> >>>>> boolean >>>>> >>>>>> setting. >>>>>> 2. Update all of these places to behave as indicated if >>>>>> >>>>> >>>>> newScoring: >>>>> >>>>>> a. Change Query.weight() to not do any normalization (no >>>>>> >>>>> >>>>> call >>>>> to >>>>> >>>>>> sumOfSquaredWeights(), Similarity.queryNorm() or >>>>>> normalize()). b. Update all Query.weight classes to set >>>>>> their value >>>>>> >>>>> >>>>> according >>>>> to >>>>> >>>>>> the terms in the score formula above that don't involve the >>>>>> >>>>> >>>>> document >>>>> >>>>>> (e.g., boost*idf for TermQuery). >>>>>> c. Add suitable netCoord definitions to all Scorer >>>>>> classes. d. Update all Scorer.score() methods to compute >>>>>> the rawScore >>>>>> >>>>> >>>>> as >>>>> >>>>>> above. >>>>>> e. Add the maximum baseScore as a field kept on TopDocs, >>>>>> >>>>> >>>>> computed >>>>> >>>>>> in the HitCollector's. >>>>>> f. Change the normalization in Hits to always divide every >>>>>> >>>>> >>>>> raw >>>>> >>>>>> score by the maximum baseScore. >>>>>> g. Update all of the current explain() methods to be >>>>>> >>>>> >>>>> consistent >>>>> >>>>>> with this scoring, and to either report the rawScore, or to >>>>>> >>>>> >>>>> report >>>>> the >>>>> >>>>>> final score if the normalization factor is provided. h. >>>>>> Add Hits.explain() (or better extend Searcher so that it >>>>>> >>>>> >>>>> keeps >>>>> >>>>>> the Hits for use in Searcher.explain()) to call the new >>>>>> explain variation with the normalization factor so that >>>>>> final scores are >>>>>> >>>>> >>>>> fully >>>>> >>>>>> explained. >>>>>> >>>>>> If this seems like a good idea, please let me know. I'm >>>>>> sure >>>>>> >>>>> >>>>> there >>>>> are >>>>> >>>>>> details I've missed that would come out during coding and >>>>>> >>>>> >>>>> testing. >>>>> Also, >>>>> >>>>>> the value of this is dependent on how reasonable the final >>>>>> scores >>>>>> >>>>> >>>>> look, >>>>> >>>>>> which is hard to tell for sure until it is working. >>>>>> >>>>>> The coding standards for Lucene seem reasonably clear from >>>>>> the >>>>>> >>>>> >>>>> source >>>>> >>>>>> code I've read. I could use just simple Java so there >>>>>> shouldn't >>>>>> >>>>> >>>>> be >>>>> any >>>>> >>>>>> significant JVM dependencies. The above should be fully >>>>>> backward compatible due to the newScoring flag. >>>>>> >>>>>> On another note, I had to remove the German analyzer in my >>>>>> >>>>> >>>>> current >>>>> 1.4.2 >>>>> >>>>>> source configuration because GermanStemmer failed to >>>>>> compile due >>>>>> >>>>> >>>>> to >>>>> what >>>>> >>>>>> are apparently Unicode character constants that I've now >>>>>> got as >>>>>> >>>>> >>>>> illegal >>>>> >>>>>> two-character character constants. This is presumably an >>>>>> >>>>> >>>>> encoding >>>>> >>>>>> problem somewhere that I could track down. It's not >>>>>> important, >>>>>> >>>>> >>>>> but >>>>> if >>>>> >>>>>> the answer is obvious to any of you, I'd appreciate the >>>>>> quick >>>>>> >>>>> >>>>> tip. >>>>> >>>>> >>>>>> Thanks, >>>>>> >>>>>> Chuck >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Doug Cutting [mailto:cutting@apache.org] Sent: >>>>>>> Monday, October 18, 2004 9:44 AM To: Lucene Developers >>>>>>> List Subject: Re: idf and explain(), was Re: Search and >>>>>>> Scoring >>>>>>> >>>>>>> Chuck Williams wrote: >>>>>>> >>>>>>>> That's a good point on how the standard vector space >>>>>>>> inner >>>>>>>> >>>>> >>>>> product >>>>> >>>>>>>> similarity measure does imply that the idf is squared >>>>>>>> >>>>> >>>>> relative >>>>> to >>>>> >>>>>> the >>>>>> >>>>>>>> document tf. Even having been aware of this formula >>>>>>>> for a >>>>>>>> >>>>> >>>>> long >>>>> >>>>>> time, >>>>>> >>>>>>>> this particular implication never occurred to me. Do >>>>>>>> you >>>>>>>> >>>>> >>>>> know >>>>> if >>>>> >>>>>>>> anybody has done precision/recall or other relevancy >>>>>>>> >>>>> >>>>> empirical >>>>> >>>>>>>> measurements comparing this vs. a model that does not >>>>>>>> >>>>> >>>>> square >>>>> idf? >>>>> >>>>> >>>>>>> No, not that I know of. >>>>>>> >>>>>>> >>>>>>>> Regarding normalization, the normalization in Hits does >>>>>>>> not >>>>>>>> >>>>> >>>>> have >>>>> >>>>>> very >>>>>> >>>>>>>> nice properties. Due to the > 1.0 threshold check, it >>>>>>>> >>>>> >>>>> loses >>>>> >>>>>>>> information, and it arbitrarily defines the highest >>>>>>>> scoring >>>>>>>> >>>>> >>>>> result >>>>> >>>>>> in >>>>>> >>>>>>>> any list that generates scores above 1.0 as a perfect >>>>>>>> >>>>> >>>>> match. >>>>> It >>>>> >>>>>> would >>>>>> >>>>>>>> be nice if score values were meaningful independent of >>>>>>>> >>>>> >>>>> searches, >>>>> >>>>>> e.g., >>>>>> >>>>>>>> if "0.6" meant the same quality of retrieval >>>>>>>> independent of >>>>>>>> >>>>> >>>>> what >>>>> >>>>>>> search >>>>>>> >>>>>>>> was done. This would allow, for example, sites to use >>>>>>>> a a >>>>>>>> >>>>> >>>>> simple >>>>> >>>>>>>> quality threshold to only show results that were "good >>>>>>>> >>>>> >>>>> enough". >>>>> >>>>>> At my >>>>>> >>>>>>>> last company (I was President and head of engineering >>>>>>>> for >>>>>>>> >>>>> >>>>> InQuira), >>>>> >>>>>> we >>>>>> >>>>>>>> found this to be important to many customers. >>>>>>>> >>>>>>> >>>>>>> If this is a big issue for you, as it seems it is, please >>>>>>> >>>>> >>>>> submit >>>>> a >>>>> >>>>>> patch >>>>>> >>>>>>> to optionally disable score normalization in Hits.java. >>>>>>> >>>>>>> >>>>>>>> The standard vector space similarity measure includes >>>>>>>> >>>>>> >>>>>> normalization by >>>>>> >>>>>>>> the product of the norms of the vectors, i.e.: >>>>>>>> >>>>>>>> score(d,q) = sum over t of ( weight(t,q) * weight(t,d) >>>>>>>> ) >>>>>>>> >>>>> >>>>> / >>>>> >>>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t) >>>>>>>> >>>>>>> >>>>>>> weight(t,d)^2) ] >>>>>>> >>>>>>> >>>>>>>> This makes the score a cosine, which since the values >>>>>>>> are >>>>>>>> >>>>> >>>>> all >>>>> >>>>>> positive, >>>>>> >>>>>>>> forces it to be in [0, 1]. The sumOfSquares() >>>>>>>> >>>>> >>>>> normalization >>>>> in >>>>> >>>>>> Lucene >>>>>> >>>>>>>> does not fully implement this. Is there a specific >>>>>>>> reason >>>>>>>> >>>>> >>>>> for >>>>> >>>>>> that? >>>>>> >>>>>> >>>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for >>>>>>> >>>>> >>>>> each >>>>> >>>>>> document >>>>>> >>>>>>> each time a document is added to the collection, since >>>>>>> >>>>> >>>>> 'weight(t,d)' >>>>> >>>>>> is >>>>>> >>>>>>> dependent on global term statistics. This is prohibitivly >>>>>>> >>>>> >>>>> expensive. >>>>> >>>>>>> Research has also demonstrated that such cosine >>>>>>> normalization >>>>>>> >>>>> >>>>> gives >>>>> >>>>>>> somewhat inferior results (e.g., Singhal's pivoted length >>>>>>> >>>>>> >>>>>> normalization). >>>>>> >>>>>> >>>>>>>> Re. explain(), I don't see a downside to extending it >>>>>>>> show >>>>>>>> >>>>> >>>>> the >>>>> >>>>>> final >>>>>> >>>>>>>> normalization in Hits. It could still show the raw >>>>>>>> score >>>>>>>> >>>>> >>>>> just >>>>> >>>>>> prior >>>>>> >>>>>>> to >>>>>>> >>>>>>>> that normalization. >>>>>>>> >>>>>>> >>>>>>> In order to normalize scores to 1.0 one must know the >>>>>>> maximum >>>>>>> >>>>> >>>>> score. >>>>> >>>>>>> Explain only computes the score for a single document, and >>>>>>> >>>>> >>>>> the >>>>> >>>>>> maximum >>>>>> >>>>>>> score is not known. >>>>>>> >>>>>>> >>>>>>>> Although I think it would be best to have a >>>>>>>> normalization that would render scores comparable across >>>>>>>> >>>>> >>>>> searches. >>>>> >>>>> >>>>>>> Then please submit a patch. Lucene doesn't change on its >>>>>>> >>>>> >>>>> own. >>>>> >>>>> >>>>>>> Doug >>>>>>> >>>>>>> >>>>> >>>>> -------------------------------------------------------------- >>>>> ---- -- >>>>> >>>>> >>>>>> - >>>>>> >>>>>>> To unsubscribe, e-mail: >>>>> >>>>> lucene-dev-unsubscribe@jakarta.apache.org >>>>> >>>>>>> For additional commands, e-mail: >>>>> >>>>> lucene-dev-help@jakarta.apache.org >>>>> >>>>> >>>>> -------------------------------------------------------------- >>>>> ---- --- To unsubscribe, e-mail: lucene-dev- >>>>> unsubscribe@jakarta.apache.org For additional commands, e- >>>>> mail: lucene-dev-help@jakarta.apache.org >>>> >>>> >> -------------------------------------------------------------------- >> >> >>>> - To unsubscribe, e-mail: lucene-dev- >>>> unsubscribe@jakarta.apache.org For additional commands, e-mail: >>>> lucene-dev-help@jakarta.apache.org >>> >>> >>> ------------------------------------------------------------------ >>> --- To unsubscribe, e-mail: lucene-dev- >>> unsubscribe@jakarta.apache.org For additional commands, e-mail: >>> lucene-dev-help@jakarta.apache.org >> >> >> -------------------------------------------------------------------- >> - To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org >> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org