[ https://issues.apache.org/jira/browse/LUCENE1896?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12753377#action_12753377
]
Mark Miller edited comment on LUCENE1896 at 9/9/09 7:09 PM:

Okay  think I was a tad off base 
Here is the cosine def used:
{code}
cos(a) = V(q) dot V(d) / V(q)V(d)
{code}
So the cosine is the query vector dot the document vector divided by the magnitude of the
vectors. Classically, V(q)V(d) is a normalization factor that takes the vectors to unit
vectors (so you get the real cosine)
{code}
cos(a) = v(q) dot v(d)
{code}
This is because the magnitude of a unit vector is 1 be definition.
But we don't care about absolute numbers, just relative numbers (as has been often pointed
out)  so the IR guys already fudge this stuff.
While I thought that the queryNorm correlates to V(q)V(d) before, I was off  its just
V(q). V(d) is replaced with the document length normalization, a much faster calculation
with similar properties  a longer doc would have a larger magnitude most likely. *edit* not
just similar properties  but many times better properties  the standard normalization would
not factor in document length at all  it essentially removes it.
So one strategy is just to not normalize query  though the lit i see doing this is very inefficiently
calculating the query norm in the inner loop  we are not doing that, and so its not much
of an optimization for us.
{code}
cos(a) = V(q) dot V(d) / V(d) == cos(a) * V(q) = v(q) dot v(d)
{code}
And it does make queries less comparable (an odd goal I know, but for free?) ;)
Sorry I was a little off earlier  just tried to learn all this myself  and linear alg was
years ago  and open book tests lured my younger, more irresponsible self to not go to the
classes ...
Anyhow, thats my current understanding  please point out if you know I have something wrong.
was (Author: markrmiller@gmail.com):
Okay  think I was a tad off base 
Here is the cosine def used:
{code}
cos(a) = V(q) dot V(d) / V(q)V(d)
{code}
So the cosine is the query vector dot the document vector divided by the magnitude of the
vectors. Classically, V(q)V(d) is a normalization factor that takes the vectors to unit
vectors (so you get the real cosine)
{code}
cos(a) = v(q) dot v(d)
{code}
This is because the magnitude of a unit vector is 1 be definition.
But we don't care about absolute numbers, just relative numbers (as has been often pointed
out)  so the IR guys already fudge this stuff.
While I thought that the queryNorm correlates to V(q)V(d) before, I was off  its just
V(q). V(d) is replaced with the document length normalization, a much faster calculation
with similar properties  a longer doc would have a larger magnitude most likely.
So one strategy is just to not normalize query  though the lit i see doing this is very inefficiently
calculating the query norm in the inner loop  we are not doing that, and so its not much
of an optimization for us.
{code}
cos(a) = V(q) dot V(d) / V(d) == cos(a) * V(q) = v(q) dot v(d)
{code}
And it does make queries less comparable (an odd goal I know, but for free?) ;)
Sorry I was a little off earlier  just tried to learn all this myself  and linear alg was
years ago  and open book tests lured my younger, more irresponsible self to not go to the
classes ...
Anyhow, thats my current understanding  please point out if you know I have something wrong.
> Modify confusing javadoc for queryNorm
> 
>
> Key: LUCENE1896
> URL: https://issues.apache.org/jira/browse/LUCENE1896
> Project: Lucene  Java
> Issue Type: Improvement
> Components: Javadocs
> Reporter: Jiri Kuhn
> Priority: Minor
> Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.

To unsubscribe, email: javadevunsubscribe@lucene.apache.org
For additional commands, email: javadevhelp@lucene.apache.org
