[ https://issues.apache.org/jira/browse/LUCENE1896?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12752533#action_12752533
]
Mark Miller edited comment on LUCENE1896 at 9/8/09 7:21 AM:

bq . The bottom line, I think, is you shouldn't compare scores across queries. Often times,
you can't even compare scores for the same query if the underlying index changed. I also don't
understand Marvin's comment about "completeness in the implementation of cosine similarity"
nor the comment about scores being "closer together than farther apart".
Its not about being able to compare scores across queries per say.
The cosine similarity is calc'd using the definition
{code}
cos(a) = V(q) dot V(d) / V(q)V(d)
{code}
queryNorm corresponds to the denominator of the right side of that equation. Suppose we take
it out  that like multiplying the cos(a) by the product of the magnitude of the vectors:
{code}
V(q)V(d) * cos(a) = V(q) dot V(d)
{code}
So rather than getting the cos as a factor, we get cos * a number that depends on the euclidean
length of the query/doc vectors.
Thats what he means by completeness of the imp of the cosine sim. Rather than getting a cosine
score, you get the cosine scaled by a somewhat arbitrary amount (that depends on the doc vector
mostly).
To get rid of that skew at no essentially no cost makes a lot of sense to me  which jives
with why IR lit and Doug keep it around. Its not there to make scores between queries comparable
 but it makes them way more comparable than they would be, and it adds to the "completeness
in the implementation of cosine similarity"  at what I am trusting is essentially no cost.
Its a keeper from my point of view. Its not based on research  its just the math of the formula
 and if it had any real expense, it would likely have been tossed long ago (in the IR world).
was (Author: markrmiller@gmail.com):
bq . The bottom line, I think, is you shouldn't compare scores across queries. Often times,
you can't even compare scores for the same query if the underlying index changed. I also don't
understand Marvin's comment about "completeness in the implementation of cosine similarity"
nor the comment about scores being "closer together than farther apart".
Its not about being able to compare scores across queries per say.
The cosine similarity is calc'd using the definition
queryNorm corresponds to the denominator of the right side of that equation. Suppose we take
it out  that like multiplying the cos(a) by the product of the magnitude of the vectors:
V(q)V(d) * cos(a) = V(q) dot V(d)
So rather than getting the cos as a factor, we get cos * a number that depends on the euclidean
length of the query/doc vectors.
Thats what he means by completeness of the imp of the cosine sim. Rather than getting a cosine
score, you get the cosine scaled by a somewhat arbitrary amount (that depends on the doc vector
mostly).
To get rid of that skew at no essentially no cost makes a lot of sense to me  which jives
with why IR lit and Doug keep it around. Its not there to make scores between queries comparable
 but it makes them way more comparable than they would be, and it adds to the "completeness
in the implementation of cosine similarity"  at what I am trusting is essentially no cost.
Its a keeper from my point of view. Its not based on research  its just the math of the formula
 and if it had any real expense, it would likely have been tossed long ago (in the IR world).
> Modify confusing javadoc for queryNorm
> 
>
> Key: LUCENE1896
> URL: https://issues.apache.org/jira/browse/LUCENE1896
> Project: Lucene  Java
> Issue Type: Improvement
> Components: Javadocs
> Reporter: Jiri Kuhn
> Priority: Minor
> Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.

To unsubscribe, email: javadevunsubscribe@lucene.apache.org
For additional commands, email: javadevhelp@lucene.apache.org
