Jake Mannix wrote:
>
>
> On Fri, Nov 20, 2009 at 2:50 PM, Mark Miller <markrmiller@gmail.com
> <mailto:markrmiller@gmail.com>> wrote:
>
> Jake Mannix wrote:
> > Remember: we're not really doing cosine at all here.
> This, I think, is fuzzy right? It seems to be common to still call
> this
> cosine scoring loosely  pretty much every practical impl fudges
> things
> somewhat when doing the normalization (though we are on the heavy side
> of fudgers)  I think its pretty rare to do the true cosine
> because its
> so expensive. It can be somewhat misleading though.
>
>
> Cosine isn't expensive  it's just *impossible* to do when you're doing
> incremental indexing: to normalize to actually do cosine similarity
> requires
> knowing the idf at *index time*, so that the cosine norm
But cosine has two norms? The query norm and the document norm  taking
the two vectors to the unit space  it looks expensive to me to do both
of them properly. The IR lit I've seen fudges them down to Root(L) even
in the non incremental case for the document vector normalization.
> (the norm Lucene does
> use on the query side) of fields  1/sqrt(sum_{i}(tf_(term_i)^2 *
> idf(term_i)^2)) 
> is impossible to calculate.
But isn't that *not* the query side norm? Looks like the document vector
norm to me.
>
> Incremental indexing has it's costs!
>
> In cases where you *do* have either the exact idf values for your corpus's
> termvocabulary (or a good way to approximate it, since it only varies
> logarithmically, this isn't too hard), then if you override lengthNorm
> to always
> return 1, and then take your document's Fields, and boost() them by
> the factor
> I listed above, then you can get exact (well, as exact as you can with
> only
> onebyte fieldnorms :\ [are we really that starved for RAM that we
> have to
> have that little precision?] ) cosine similarity, at really no
> additional *querytime*
> performance hit. Just a little math done at indexing time.
Where are you doing the correct document vector normalization to get the
true cosine? I'm not following. It looks like the above formula is the
doc vector norm  so what about the query vector norm then?
>
> Have you looked at the Similarity scoring explanation page that was
> recently improved? Have any suggestions on changes to it? Doron put a
> fair amount of work into improving it recently, but I think it could
> always get better. Its currently leaning towards presenting this as
> cosine  that seems in line with the few text books I've seen, but I'm
> admittedly not that deep into any of this.
>
>
> Yeah, that page is a lot better than it used to be, but I'd really try
> not to
> call this cosine  when people think of cosines, they think of:
>
> a) angles,
> b) numbers which are less than one, and
> c) when they're equal to 1, you have a perfect match.
When they think of cosine yes  but IR lit seems to fudge this  non of
the common document length normalizations are real cosine  they are
just square root variations  with Lucene's being the most simple in the
family.
>
> We have a)  kinda. We do not have b) or c), and we don't have scores
> which are really comparable, and maybe Grant's question should be
> answered with the statement: they shouldn't be:
Right  we have always said they are not really comparable. But there
are degrees of that. I think they have a degree of comparability  they
must if we can make them less comparable, and we can  drop queryNorm.
>
> If I query the google with "web" (getting a little less than three
> billion hits,
> so it maybe has an idf of 3 or 4), and find a document with 10 terms, one
> of which is "web" (and the others are equally common in terms of idf),
> then lucene says we should score that as:
>
> idf(web) / sqrt(10) =~ 0.9,
>
> while cosine would say:
>
> idf(web) / sqrt(1^2 * idf(web)^2 + tf_common_term2^2 * idf(term2)^2
> + ... )
>
> =~ 1/sqrt(10) = 0.31
>
> If I instead query with "floccinaucinilipilification" (which comes up
> with
> about 45K hits, so maybe has an idf of 20), and find a document with
> 10 terms, one of which is floccinaucinilipilification, and the other terms
> are also similarly rare (also idf 20), lucene says the score is
>
> idf(floccinaucinilipilification) / sqrt(10), = 6.5
>
> while cosine would say:
>
> idf(floccinaucinilipilification) / sqrt(1^2 idf(flocci...)^2 +
> tf*rareTerm2^2 + ... )
>
> =~ 1/sqrt(10) = 0.31
>
> What is nice about *real* cosine is that it captures a sense of absolute
> scoring  the hit on the really rare word in a document which does not
> have a lot of other rare words (and is in general pretty short) is indeed
> measured by the fact that the score is close to 1.0 (perfect match),
> whereas the Lucene default scoring does not give a good measure 
> web scores close to 1.0 on its hit, and you only notice that this
> isn't as
> good as the rare word hit because that one scores way *higher* than 1.0.
>
> On the other hand, the thing which Lucene scoring does which cosine
> does *not* is measure the fact that hitting the rare word is *way* better
> than hit on a common word, from a user perspective, so giving them
> equal scores because their cosines are the same is maybe not the
> right thing to do (and so "yay Doug!", heh).
>
> Have you looked at the Similarity scoring explanation page that was
> recently improved? Have any suggestions on changes to it? Doron put a
> fair amount of work into improving it recently, but I think it could
> always get better. Its currently leaning towards presenting this as
> cosine  that seems in line with the few text books I've seen, but I'm
> admittedly not that deep into any of this.
>
>
> I think it's way better than it used to be (although it's even more
> dense, which
> I guess is required, given that it's a dense subject), but I really
> don't think
> we want to let people think it's a cosine  it's "like" a cosine, but
> with a
> rather different normalization.
We can amp that  but I like I said, I think thats standard practice 
in the IR books I've seen, they say, here is the cosine formula for term
vector scoring, and then they go on to fudge in a manner very similar to
how Lucene does it.
> Doron's got it stated like that, already, but
> it needs to be crystal clear that scores are sometimes higher than 1,
> and in
> fact the "best" score for one query may be rather different than the
> "best"
> for another query.
I think this is general knowledge with Lucene, but that may be a stretch
 perhaps its not as wide know as I now assume for new users. Hits used
to normalize to a 01 score actually, but its no longer around.
>
> jake

To unsubscribe, email: javadevunsubscribe@lucene.apache.org
For additional commands, email: javadevhelp@lucene.apache.org
