On Fri, Nov 20, 2009 at 4:09 PM, Mark Miller <markrmiller@gmail.com> wrote:
> But cosine has two norms? The query norm and the document norm  taking
> the two vectors to the unit space  it looks expensive to me to do both
> of them properly. The IR lit I've seen fudges them down to Root(L) even
> in the non incremental case for the document vector normalization.
>
Lucene has two norms too: lengthNorm (of the field), produced at index time,
and queryNorm. The expense of queryNorm is done once per query, not
once per document, and so as you yourself say, is not a performance
problem. The lengthNorm is factored into the document at index time, and
so is not a hit at *query time* at all
> > (the norm Lucene does
> > use on the query side) of fields  1/sqrt(sum_{i}(tf_(term_i)^2 *
> > idf(term_i)^2)) 
> > is impossible to calculate.
> But isn't that *not* the query side norm? Looks like the document vector
> norm to me.
>
This is exactly the query norm  1/sqrt(sumOfSquaredWeights), and
sum of squared weights is just sum of tf^2 * idf^2, except queries have
"boost" in place of "tf". The document vector norm that *Lucene* uses is
independent of idf of the terms in the document (it's 1/sqrt(doc_length)
for DefaultSimilarity).
> Where are you doing the correct document vector normalization to get the
> true cosine? I'm not following. It looks like the above formula is the
> doc vector norm  so what about the query vector norm then?
>
Ok, the technique I use to do this (I should really just refactor it to a
point
where I can post it for contrib/queries so that other people can use this
if it's not that commonly known), is to, at index time, calculate the true
document vector normalizations for the fields, and then feed it into their
boosts, and make sure that Similarity does *not* do field normalization
itself.
The formula for query vector normalization and document vector normalization
is *exactly* the same if you translate boosts of terms into termFrequency of
said term in the query "document". It's just applied to a different bag of
words.
> > a) angles,
> > b) numbers which are less than one, and
> > c) when they're equal to 1, you have a perfect match.
>
> When they think of cosine yes  but IR lit seems to fudge this  non of
> the common document length normalizations are real cosine  they are
> just square root variations  with Lucene's being the most simple in the
> family.
>
This is true, which is why I'm not really giving a better wording than what
Doron already wrote  I doubt I could do better. I'm concerned myself
because I've seen lots of questions over the years on the lists of people
asking about how to compare scores across queries, no matter how much
we tell them not to. And now here again we're telling them "they're not
comparable, but they're more comparable than if we didn't have this
queryNorm in there!". What exactly does that mean? I do search, and
I'm not really sure what it means, what does someone who is just learning
this stuff going to think?
>
> >
> > I think it's way better than it used to be (although it's even more
> > dense, which
> > I guess is required, given that it's a dense subject), but I really
> > don't think
> > we want to let people think it's a cosine  it's "like" a cosine, but
> > with a
> > rather different normalization.
>
> We can amp that  but I like I said, I think thats standard practice 
> in the IR books I've seen, they say, here is the cosine formula for term
> vector scoring, and then they go on to fudge in a manner very similar to
> how Lucene does it.
>
Well just because it's standard practice, doesn't mean it's a good idea.
I came to search from a math background, not CS, so when I see "vector
space" and "cosine", I maybe think different things than the average coder
learning IR. So maybe I'm not the right person to ask how such things
should be explained!
> > Doron's got it stated like that, already, but
> > it needs to be crystal clear that scores are sometimes higher than 1,
> > and in
> > fact the "best" score for one query may be rather different than the
> > "best"
> > for another query.
>
> I think this is general knowledge with Lucene, but that may be a stretch
>  perhaps its not as wide know as I now assume for new users. Hits used
> to normalize to a 01 score actually, but its no longer around.
>
It was definitely a good idea to get rid of normalizing by the top score,
but
what could replace it is normalizing by the "best possible score". This
tells
you how close to a perfect match you could have gotten. For cosine, this
would
just be the cosine of the angle from the document vector to the query
vector,
for Lucene, it would be something similar, but would have absolute meaning
(maybe... maybe not, because the theoretical best score is likely to be far
from the actual best score in a given corpus).
jake
