lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Mark Nemeskey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3357) Unit and integration test cases for the new Similarities
Date Thu, 11 Aug 2011 12:03:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083080#comment-13083080
] 

David Mark Nemeskey commented on LUCENE-3357:
---------------------------------------------

Apparently the Dirichlet method returns a negative score if the tf / docLen < corpusTf
/ corpusLen. Unfortunately the negative number can be arbitrarily large, so it's not as easy
as adding a constant to the score. This of course makes sense if all documents are scored,
as the function is monotone and consequently documents, whose tf is 0, will always be ranked
lower than those that contain the word. But this is not how IR engines work.

Having said that, I believe that we could simulate such a system. I don't know exactly how
the query architecture works, but I presume the clauses that don't match a document are assigned
a zero value. Now instead of this zero, the Scorer (or whatever class does this) could ask
for a default value from the Similarity. In this case LMDirichletSimilarity could return score(stats,
0, Integer.MAX_VALUE), which is somewhere around -12.

If we don't do this, we have three options:
1. add score(stats, 0, Integer.MAX_VALUE) to the score
2. if (score < 0) return 0
3. add corpusTf / corpusLen * docLen to tf

All ensure a positive score, but also each has its own disadvantage.
1. adds a pretty big constant to the score, which may not play well with the other parts of
the query
2. some documents that contain the term get the same 0 score as documents that don't (though
I cannot say this is not in line with the LM approach)
3. this introduces a transformation that is difficult to characterize

For the time being, I'll go with 2, but we have to discuss this.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch,
LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220].
Two types of test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and the score
is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then searched using
the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message