lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect
Date Sat, 12 Sep 2009 19:13:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754577#action_12754577
] 

Mark Miller edited comment on LUCENE-1908 at 9/12/09 12:12 PM:
---------------------------------------------------------------

bq. Is it really better? It seems to "punish" the same for length due to distinct terms, and
to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition
says no, but I am not sure.

Its only desired behavior if you have a corpus that favors it, but most do. I think you can
think of the |V(d)| as taking out information about the document length - you start with an
m space vector, with each term given a weight depending on how many times it occurs - in other
words, there is information about the length of that document there, and when you normalize
by |V(d)|, you will take out that information - but it will appear more similar the more unique
terms it started with and the higher the tf's. So that method favors long docs, witch will
naturally have more of both, though of course not always be more similar.

All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something).
All of the others also try and make up for this doc length problem - by messing with the curve
so that ultra long docs are not favored too highly. Our default method is a known not very
good one - buts its also very fast and efficient compared to the better ones. Sweetspot is
much better and I think still efficient - but you need to tune the right params I believe.

* edit *

I'm still a little confused I guess :) The longer docs will have larger weights naturally
is what I meant, but larger weights actually hurts in the cosine normalization - so it actually
over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to
a relevancy curve they figure out ( a probability of relevance as a function of document length),
and how the pivoted cosine curves compare against it. I'm just reading across random interweb
pdfs from google. Apparently our pivot also over punishes large docs and over favors small,
the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf.
This stuff is hard to grok on first pass.

      was (Author: markrmiller@gmail.com):
    bq. Is it really better? It seems to "punish" the same for length due to distinct terms,
and to punish less for length due to duplicate terms. Is this really a desired behavior? My
intuition says no, but I am not sure.

Its only desired behavior if you have a corpus that favors it, but most do. I think you can
think of the |V(d)| as taking out information about the document length - you start with an
m space vector, with each term given a weight depending on how many times it occurs - in other
words, there is information about the length of that document there, and when you normalize
by |V(d)|, you will take out that information - but it will appear more similar the more unique
terms it started with and the higher the tf's. So that method favors long docs, witch will
naturally have more of both, though of course not always be more similar.

All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something).
All of the others also try and make up for this doc length problem - by messing with the curve
so that ultra long docs are not favored too highly. Our default method is a known not very
good one - buts its also very fast and efficient compared to the better ones. Sweetspot is
much better and I think still efficient - but you need to tune the right params I believe.
  
> Similarity javadocs for scoring function to relate more tightly to scoring models in
effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message