lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect
Date Sat, 12 Sep 2009 17:48:57 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754571#action_12754571
] 

Doron Cohen commented on LUCENE-1908:
-------------------------------------

Mark and Shai Thanks for reviewing!

Mark, I think you have a point here (and I am definitely no more an IR guy than you are :)).

Truth is I was surprised to find out (through your comments in LUCENE-1896) that this component
of the score is "missing", and I indeed thought that the "right thing to do" (if there is
such thing as "right") really is to do both: normalize to the unit vector, and then normalize
by length to compensate for "unfair" advantage of long documents. 

But you're right, and the way I presented V(d) normalization and doc-length normalization
is incorrect, as if it is a the right thing to do both, and the way it is presented is not
doing justice to Lucene. I will change the writing. 

Interestingly, for a document containing N distinct terms, the 1/Euclidean-norm and Lucene's
default similarity's length norm are the same: 1/sqrt(N). But if you double that doc to have
two occurrences of each of the N distinct terms, its length would be 2N, 1/Euclidean-norm
would be 1/sqrt(4N) but Lucene's default similarity's length norm would be 1/sqrt(2N). So
we will punish documents with duplicate terms less than would the Euclidean norm...  

I am not aware of an evaluation or discussion of this - I mean - why was this approach selected,
and so I assumed (under question) that it was merely for performance considerations. You said
in Lucene-1896:
bq. not just similar properties - but many times better properties - the standard normalization
would not factor in document length at all - it essentially removes it.
Is it really better? It seems to "punish" the same for length due to distinct terms, and to
punish less for length due to duplicate terms. Is this really a desired behavior? My intuition
says no, but I am not sure.

Anyhow this issue more about describing what Lucene is doing today than on what should Lucene
do, so think I have the correct picture now (except for historical justification which is
interesting but not a show stopper).

Shai thanks for the fixes. 

(updated patch to follow).

> Similarity javadocs for scoring function to relate more tightly to scoring models in
effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message