Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com
 designates 74.125.92.26 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:user-agent:mime-version:to:subject:references
         :in-reply-to:x-enigmail-version:content-type
         :content-transfer-encoding;
        b=HgPFQear2Farmt1uo8dvbtXrQhT2c5JYsnEzWVYRr/w8wWfM4aM2L6Beo8wqSBkTjc
         sKJVGNume5KshNkPqrY7ad4fHHittL0akQyhanthOUbwre5U30nnl1wCzp/vu6yVyUVr
         DS3NyES5umVF1vg2bAzuPov7anfG4VOaEWC2U=
Message-ID: <4B071AD4.8040300@gmail.com>
Date: Fri, 20 Nov 2009 17:40:20 -0500
From: Mark Miller <markrmiller@gmail.com>
User-Agent: Thunderbird 2.0.0.23 (X11/20090817)
MIME-Version: 1.0
To: java-dev@lucene.apache.org
Subject: Re: Whither Query Norm?
References: <FB396ABA-677E-4F56-B56C-05E390EBD222@apache.org>
	 <4b124c310911200815g5e2cb2cay22752a658c5dcc09@mail.gmail.com>
	 <4b124c310911200819v21bb443i29b26d88c0b3c5d@mail.gmail.com>
	 <50157FE2-D845-49F2-93E8-DB5B103DB059@apache.org>
	 <4b124c310911201024v3406e22s4f246ecffd3461dc@mail.gmail.com>
	 <C77DFFD1-149C-4A9B-95E6-98DF60245085@apache.org>
	 <4B07171C.1050404@gmail.com>
 <4b124c310911201431k28da961bse2c2755fa7b26505@mail.gmail.com>
In-Reply-To: <4b124c310911201431k28da961bse2c2755fa7b26505@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Yes, its a good point. I'm coming at it from a more pure angle. And I'm
not so elegant in my thought patterns :)

Right though - our document vector normalization is - uh - quick and
dirty :) Its about the cheapest one I've seen other than root(length).

I don't think that scores between queries are very comparable in general
in Lucene  either- but they would be even less so if we dropped the
query norm. As I've argued in the past - if it had any real perf hit,
I'd be on the side of dropping it - but from what I can see, it really
doesn't, so I don't see why we should further skew the scores.

Jake Mannix wrote:
> Remember: we're not really doing cosine at all here.  The factor of
> IDF^2 on
> the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom
> couples
> together to end up with the following effect:
>
>  q1 = "TERM1"
>  q2 = "TERM2"
>
> doc1 = "TERM1"
> doc2 = "TERM2"
>
> score(q1, doc1) = idf(TERM1)
> score(q2, doc2) = idf(TERM2)
>
> Both are perfect matches, but one scores higher (possibly much higher)
> than
> the other.
>
> Boosts work just fine with cosine (it's just a way of putting "tf"
> into the query side
> as well as in the document side), but normalizing documents without
> taking the
> idf of terms in the document into consideration blows away the ability to
> compare scores in default Lucene scoring, even *with* the queryNorm()
> factored
> in.
>
> I know you probably know this Mark, but it's important to make sure
> we're stating
> that in Lucene as is currently structured, scores can be *wildly*
> different between
> queries, even with queryNorm() factored in, for the sake of people
> reading this
> who haven't worked through the scoring in detail.
>
>   -jake
>  
>
> On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller <markrmiller@gmail.com
> <mailto:markrmiller@gmail.com>> wrote:
>
>     Grant Ingersoll wrote:
>     >
>     >  What I would like to get at is why anyone thinks scores are
>     > comparable across queries to begin with.
>     >
>     They are somewhat comparable because we are using the approximate
>     cosine
>     between the document/query vectors for the score - plus boosts n
>     stuff.
>     How close the vectors are to each other. If q1 has a smaller angle
>     diff
>     with d1 than q2 does with d2, then you can do a comparison. Its just
>     vector similarities. Its approximate because we fudge the
>     normalization.
>     Why do you think the scores within a query search are comparable?
>     Whats
>     the difference when you try another query? The query is the
>     difference,
>     and the query norm is what makes it more comparable. Its just a
>     different query vector with another query. Its still going to just
>     be a
>     given "angle" from the doc vectors. Closer is considered a better
>     match.
>     We don't do it to improve anything, or because someone discovered
>     something - its just part of the formula for calculating the
>     cosine. Its
>     the dot product formula. You can lose it and keep the same relative
>     rankings, but then you are further from the cosine for the score - you
>     start scaling by the magnitude of the query vector. When you do that
>     they are not so comparable.
>
>     If you take out the queryNorm, its much less comparable. You are
>     effectively multiplying the cosine by the magnitude of the query
>     vector
>     - so different queries will scale the score differently - and not in a
>     helpful way - a term vector and query vector can have very different
>     magnitudes, but very similar term distributions. Thats why we are
>     using
>     the cosine rather than euclidean distance in the first place. Pretty
>     sure its more linear algebra than IR - or the vector stuff from calc 3
>     (or wherever else different schools put it).
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>     <mailto:java-dev-unsubscribe@lucene.apache.org>
>     For additional commands, e-mail: java-dev-help@lucene.apache.org
>     <mailto:java-dev-help@lucene.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org