lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect
Date Sat, 12 Sep 2009 04:04:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754470#action_12754470
] 

Mark Miller commented on LUCENE-1908:
-------------------------------------

Looks great!

bq. Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions
(?).

Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization
factor with our document length factor?

That's how it appears to me anyway -

for |V(d)| you have many options right?

the cosine normalization - your standard euclidean length - |V(d)| 
or none (eg 1)
or pivoted normalized doc length
or SweetSpotSimilarity's formula
or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default
or Okapi's formula, 
or ...

So we are replacing (which everyone generally does) not dropping right?

And I don't think we are replacing for performance reasons (though it is complicated to calculate)
- we are replacing because its not a great normalization factor.
Using |V(d)| eliminates info on the length of the orig document - but longer documents will
still have higher tf's and more distinct terms - so it unnaturally gives them
an advantage (most long docs will be repeated pieces or cover multiple topics). So its not
generally a good normalization factor, and we have a chosen another?
(the one we have chosen isnt great either - long docs are punished too much and short preferred
too much)

Again, I'm not an IR guy, but thats what my modest take is.

> Similarity javadocs for scoring function to relate more tightly to scoring models in
effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message