Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 61935 invoked from network); 21 Oct 2004 16:46:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 21 Oct 2004 16:46:41 -0000 Received: (qmail 3004 invoked by uid 500); 21 Oct 2004 16:46:37 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 2943 invoked by uid 500); 21 Oct 2004 16:46:37 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 2928 invoked by uid 99); 21 Oct 2004 16:46:36 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [212.227.126.190] (HELO moutng.kundenserver.de) (212.227.126.190) by apache.org (qpsmtpd/0.28) with ESMTP; Thu, 21 Oct 2004 09:46:36 -0700 Received: from [212.227.126.209] (helo=mrelayng.kundenserver.de) by moutng.kundenserver.de with esmtp (Exim 3.35 #1) id 1CKg5C-0001oC-00 for lucene-dev@jakarta.apache.org; Thu, 21 Oct 2004 18:46:34 +0200 Received: from [82.135.3.121] (helo=[192.168.10.117]) by mrelayng.kundenserver.de with asmtp (TLSv1:RC4-MD5:128) (Exim 3.35 #1) id 1CKg5B-0002yD-00 for lucene-dev@jakarta.apache.org; Thu, 21 Oct 2004 18:46:33 +0200 Message-ID: <4177E811.8050500@detego-software.de> Date: Thu, 21 Oct 2004 18:47:13 +0200 From: Christoph Goller User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.7.3) Gecko/20040914 X-Accept-Language: de, en-us, en, de-at MIME-Version: 1.0 To: Lucene Developers List Subject: Re: idf and explain(), was Re: Search and Scoring References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: kundenserver.de abuse@kundenserver.de auth:12f525e90d51bb735119ab4626f6800d X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Chuck wrote: > That's a good point on how the standard vector space inner product > similarity measure does imply that the idf is squared relative to the > document tf. Even having been aware of this formula for a long time, > this particular implication never occurred to me. The same holds for me :-) Chuck wrote: > I researched the idf2 issue further and believe that empirical studies > have consistently concluded that one idf factor should be dropped. > Salton, the originator of the IR vector space model, decided to drop the > idf term on documents in order to avoid the squaring. I hear he did > this after studying recall and precision for many variations of his > formula. The cosine-distance and dot-product motivations should not be a dogma. One can question the idea of using the same representation (as vector of terms) for a query and a document. The intuition behind not squaring idf seems equally well-found as the dot-product/cosine-distance motivation. However, I also question empirical results here, since the right scoring of documents is also very subjective :-) Chuck wrote: > Regarding normalization, the normalization in Hits does not have very > nice properties. Due to the > 1.0 threshold check, it loses > information, and it arbitrarily defines the highest scoring result in > any list that generates scores above 1.0 as a perfect match. It would > be nice if score values were meaningful independent of searches, e.g., > if "0.6" meant the same quality of retrieval independent of what search > was done. This would allow, for example, sites to use a a simple > quality threshold to only show results that were "good enough". At my > last company (I was President and head of engineering for InQuira), we > found this to be important to many customers. > > The standard vector space similarity measure includes normalization by > the product of the norms of the vectors, i.e.: > > score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) / > sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ] > > This makes the score a cosine, which since the values are all positive, > forces it to be in [0, 1]. The sumOfSquares() normalization in Lucene > does not fully implement this. Is there a specific reason for that? You are right. The normalization on the documents contribution is missing and there are the additional coord-factors. The current implementation is a mixture of dot-product and cosine-distance. Therefore, the additional normalization in Hits is necessary if one wants to avoid scores > 1.0. Doug wrote: > The quantity 'sum(t) weight(t,d)2' must be recomputed for each document each > time a document is added to the collection, since 'weight(t,d)' is dependent > on global term statistics. This is prohibitivly expensive. weight(t,d) is currently computed on the fly in every scorer. BooleanScorer and ConjunctionScorer could collect the document normalization, and in the end the searcher could apply normalization. All Scorers would have to be extended to supply document normalization. Does this seem reasonable? Chuck wrote: > To illustrate the problem better normalization is intended to address, > in my current application for BooleanQuery's of two terms, I frequently > get a top score of 1.0 when only one of the terms is matched. 1.0 > should indicate a "perfect match". I'd like set my UI up to present the > results differently depending on how good the different results are > (e.g., showing a visual indication of result quality, dropping the > really bad results entirely, and segregating the good results from other > only vaguely relevant results). To build this kind of "intelligence" > into the UI, I certainly need to know whether my top result matched all > query terms, or only half of them. I'd like to have the score tell me > directly how good the matches are. The current normalization does not > achieve this. > > The intent of a new normalization scheme is to preserve the current > scoring in the following sense: the ratio of the nth result's score to > the best result's score remains the same. Therefore, the only question > is what normalization factor to apply to all scores. This reduces to a > very specific question that determines the entire normalization -- what > should the score of the best matching result be? I would prefer your first idea with the cosine normalization, if an efficient implementation is possible. As stated above, I currently think it is possible. Christoph --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org