Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 52545 invoked from network); 20 Nov 2009 22:40:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Nov 2009 22:40:57 -0000 Received: (qmail 42095 invoked by uid 500); 20 Nov 2009 22:40:56 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 42012 invoked by uid 500); 20 Nov 2009 22:40:56 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 42004 invoked by uid 99); 20 Nov 2009 22:40:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Nov 2009 22:40:56 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com designates 74.125.92.26 as permitted sender) Received: from [74.125.92.26] (HELO qw-out-2122.google.com) (74.125.92.26) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Nov 2009 22:40:53 +0000 Received: by qw-out-2122.google.com with SMTP id 9so290245qwb.53 for ; Fri, 20 Nov 2009 14:40:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :x-enigmail-version:content-type:content-transfer-encoding; bh=MC5xE98ZPE+kxXYYk9iffFvrhELtXHuwJYMn1WMlq1M=; b=Ldz9PTy7VSb6qV/QspRu+2xOrzaFhhDwLFiLFDXW//U09eWz1dd0SklSXc98mNtqZt qmDjIDFHUGWMLMDXy/49WYWgBjzhpFOSjxdHzFL3txjI6vzAIXhD6RC3VfrPJj+jOTB8 c2FPO7+UO1soUWvAq2l32ULPztuC45GHVVhk4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=HgPFQear2Farmt1uo8dvbtXrQhT2c5JYsnEzWVYRr/w8wWfM4aM2L6Beo8wqSBkTjc sKJVGNume5KshNkPqrY7ad4fHHittL0akQyhanthOUbwre5U30nnl1wCzp/vu6yVyUVr DS3NyES5umVF1vg2bAzuPov7anfG4VOaEWC2U= Received: by 10.224.45.34 with SMTP id c34mr1140353qaf.15.1258756832713; Fri, 20 Nov 2009 14:40:32 -0800 (PST) Received: from ?192.168.1.102? (ool-44c639d9.dyn.optonline.net [68.198.57.217]) by mx.google.com with ESMTPS id 6sm3182066qwd.36.2009.11.20.14.40.28 (version=SSLv3 cipher=RC4-MD5); Fri, 20 Nov 2009 14:40:31 -0800 (PST) Message-ID: <4B071AD4.8040300@gmail.com> Date: Fri, 20 Nov 2009 17:40:20 -0500 From: Mark Miller User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Whither Query Norm? References: <4b124c310911200815g5e2cb2cay22752a658c5dcc09@mail.gmail.com> <4b124c310911200819v21bb443i29b26d88c0b3c5d@mail.gmail.com> <50157FE2-D845-49F2-93E8-DB5B103DB059@apache.org> <4b124c310911201024v3406e22s4f246ecffd3461dc@mail.gmail.com> <4B07171C.1050404@gmail.com> <4b124c310911201431k28da961bse2c2755fa7b26505@mail.gmail.com> In-Reply-To: <4b124c310911201431k28da961bse2c2755fa7b26505@mail.gmail.com> X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Yes, its a good point. I'm coming at it from a more pure angle. And I'm not so elegant in my thought patterns :) Right though - our document vector normalization is - uh - quick and dirty :) Its about the cheapest one I've seen other than root(length). I don't think that scores between queries are very comparable in general in Lucene either- but they would be even less so if we dropped the query norm. As I've argued in the past - if it had any real perf hit, I'd be on the side of dropping it - but from what I can see, it really doesn't, so I don't see why we should further skew the scores. Jake Mannix wrote: > Remember: we're not really doing cosine at all here. The factor of > IDF^2 on > the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom > couples > together to end up with the following effect: > > q1 = "TERM1" > q2 = "TERM2" > > doc1 = "TERM1" > doc2 = "TERM2" > > score(q1, doc1) = idf(TERM1) > score(q2, doc2) = idf(TERM2) > > Both are perfect matches, but one scores higher (possibly much higher) > than > the other. > > Boosts work just fine with cosine (it's just a way of putting "tf" > into the query side > as well as in the document side), but normalizing documents without > taking the > idf of terms in the document into consideration blows away the ability to > compare scores in default Lucene scoring, even *with* the queryNorm() > factored > in. > > I know you probably know this Mark, but it's important to make sure > we're stating > that in Lucene as is currently structured, scores can be *wildly* > different between > queries, even with queryNorm() factored in, for the sake of people > reading this > who haven't worked through the scoring in detail. > > -jake > > > On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller > wrote: > > Grant Ingersoll wrote: > > > > What I would like to get at is why anyone thinks scores are > > comparable across queries to begin with. > > > They are somewhat comparable because we are using the approximate > cosine > between the document/query vectors for the score - plus boosts n > stuff. > How close the vectors are to each other. If q1 has a smaller angle > diff > with d1 than q2 does with d2, then you can do a comparison. Its just > vector similarities. Its approximate because we fudge the > normalization. > Why do you think the scores within a query search are comparable? > Whats > the difference when you try another query? The query is the > difference, > and the query norm is what makes it more comparable. Its just a > different query vector with another query. Its still going to just > be a > given "angle" from the doc vectors. Closer is considered a better > match. > We don't do it to improve anything, or because someone discovered > something - its just part of the formula for calculating the > cosine. Its > the dot product formula. You can lose it and keep the same relative > rankings, but then you are further from the cosine for the score - you > start scaling by the magnitude of the query vector. When you do that > they are not so comparable. > > If you take out the queryNorm, its much less comparable. You are > effectively multiplying the cosine by the magnitude of the query > vector > - so different queries will scale the score differently - and not in a > helpful way - a term vector and query vector can have very different > magnitudes, but very similar term distributions. Thats why we are > using > the cosine rather than euclidean distance in the first place. Pretty > sure its more linear algebra than IR - or the vector stuff from calc 3 > (or wherever else different schools put it). > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-dev-help@lucene.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org