On Fri, Nov 20, 2009 at 2:50 PM, Mark Miller <markrmiller@gmail.com> wrote:
> Jake Mannix wrote:
> > Remember: we're not really doing cosine at all here.
> This, I think, is fuzzy right? It seems to be common to still call this
> cosine scoring loosely  pretty much every practical impl fudges things
> somewhat when doing the normalization (though we are on the heavy side
> of fudgers)  I think its pretty rare to do the true cosine because its
> so expensive. It can be somewhat misleading though.
>
Cosine isn't expensive  it's just *impossible* to do when you're doing
incremental indexing: to normalize to actually do cosine similarity requires
knowing the idf at *index time*, so that the cosine norm (the norm Lucene
does
use on the query side) of fields  1/sqrt(sum_{i}(tf_(term_i)^2 *
idf(term_i)^2)) 
is impossible to calculate.
Incremental indexing has it's costs!
In cases where you *do* have either the exact idf values for your corpus's
termvocabulary (or a good way to approximate it, since it only varies
logarithmically, this isn't too hard), then if you override lengthNorm to
always
return 1, and then take your document's Fields, and boost() them by the
factor
I listed above, then you can get exact (well, as exact as you can with only
onebyte fieldnorms :\ [are we really that starved for RAM that we have to
have that little precision?] ) cosine similarity, at really no additional
*querytime*
performance hit. Just a little math done at indexing time.
Have you looked at the Similarity scoring explanation page that was
> recently improved? Have any suggestions on changes to it? Doron put a
> fair amount of work into improving it recently, but I think it could
> always get better. Its currently leaning towards presenting this as
> cosine  that seems in line with the few text books I've seen, but I'm
> admittedly not that deep into any of this.
>
Yeah, that page is a lot better than it used to be, but I'd really try not
to
call this cosine  when people think of cosines, they think of:
a) angles,
b) numbers which are less than one, and
c) when they're equal to 1, you have a perfect match.
We have a)  kinda. We do not have b) or c), and we don't have scores
which are really comparable, and maybe Grant's question should be
answered with the statement: they shouldn't be:
If I query the google with "web" (getting a little less than three billion
hits,
so it maybe has an idf of 3 or 4), and find a document with 10 terms, one
of which is "web" (and the others are equally common in terms of idf),
then lucene says we should score that as:
idf(web) / sqrt(10) =~ 0.9,
while cosine would say:
idf(web) / sqrt(1^2 * idf(web)^2 + tf_common_term2^2 * idf(term2)^2 + ...
)
=~ 1/sqrt(10) = 0.31
If I instead query with "floccinaucinilipilification" (which comes up with
about 45K hits, so maybe has an idf of 20), and find a document with
10 terms, one of which is floccinaucinilipilification, and the other terms
are also similarly rare (also idf 20), lucene says the score is
idf(floccinaucinilipilification) / sqrt(10), = 6.5
while cosine would say:
idf(floccinaucinilipilification) / sqrt(1^2 idf(flocci...)^2 +
tf*rareTerm2^2 + ... )
=~ 1/sqrt(10) = 0.31
What is nice about *real* cosine is that it captures a sense of absolute
scoring  the hit on the really rare word in a document which does not
have a lot of other rare words (and is in general pretty short) is indeed
measured by the fact that the score is close to 1.0 (perfect match),
whereas the Lucene default scoring does not give a good measure 
web scores close to 1.0 on its hit, and you only notice that this isn't as
good as the rare word hit because that one scores way *higher* than 1.0.
On the other hand, the thing which Lucene scoring does which cosine
does *not* is measure the fact that hitting the rare word is *way* better
than hit on a common word, from a user perspective, so giving them
equal scores because their cosines are the same is maybe not the
right thing to do (and so "yay Doug!", heh).
Have you looked at the Similarity scoring explanation page that was
recently improved? Have any suggestions on changes to it? Doron put a
fair amount of work into improving it recently, but I think it could
always get better. Its currently leaning towards presenting this as
cosine  that seems in line with the few text books I've seen, but I'm
admittedly not that deep into any of this.
I think it's way better than it used to be (although it's even more dense,
which
I guess is required, given that it's a dense subject), but I really don't
think
we want to let people think it's a cosine  it's "like" a cosine, but with a
rather different normalization. Doron's got it stated like that, already,
but
it needs to be crystal clear that scores are sometimes higher than 1, and in
fact the "best" score for one query may be rather different than the "best"
for another query.
jake
