lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2187) improve lucene's similarity algorithm defaults
Date Thu, 09 May 2013 23:06:13 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2187:
----------------------------------

    Fix Version/s:     (was: 4.3)
                   4.4
    
> improve lucene's similarity algorithm defaults
> ----------------------------------------------
>
>                 Key: LUCENE-2187
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2187
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/query/scoring
>            Reporter: Robert Muir
>             Fix For: 4.4
>
>         Attachments: LUCENE-2187.patch, scoring.pdf, scoring.pdf, scoring.pdf
>
>
> First things first: I am not an IR guy. The goal of this issue is to make 'surgical'
tweaks to lucene's formula to bring its performance up to that of more modern algorithms such
as BM25.
> In my opinion, the concept of having some 'flexible' scoring with good speed across the
board is an interesting goal, but not practical in the short term.
> Instead here I propose incorporating some work similar to lnu.ltc and friends, but slightly
different. I noticed this seems to be in line with that paper published before about the trec
million queries track... 
> Here is what I propose in pseudocode (overriding DefaultSimilarity):
> {code}
>   @Override
>   public float tf(float freq) {
>     return 1 + (float) Math.log(freq);
>   }
>   
>   @Override
>   public float lengthNorm(String fieldName, int numTerms) {
>     return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
>   }
> {code}
> Where slope is a constant (I used 0.25 for all relevance evaluations: the goal is to
have a better default), and pivot is the average field length. Obviously we shouldnt make
the user provide this but instead have the system provide it.
> These two pieces do not improve lucene much independently, but together they are competitive
with BM25 scoring with the test collections I have run so far. 
> The idea here is that this logarithmic tf normalization is independent of the tf / mean
TF that you see in some of these algorithms, in fact I implemented lnu.ltc with cosine pivoted
length normalization and log(tf)/log(mean TF) stuff and it did not fare as well as this method,
and this is simpler, we do not need to calculate this mean TF at all.
> The BM25-like "binary" pivot here works better on the test collections I have run, but
of course only with the tf modification.
> I am uploading a document with results from 3 test collections (Persian, Hindi, and Indonesian).
I will test at least 3 more languages... yes including English... across more collections
and upload those results also, but i need to process these corpora to run the tests with the
benchmark package, so this will take some time (maybe weeks)
> so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 test collections
are in the openrelevance svn, so if you think you have a great idea, don't hesitate to test
it and upload results, this is what it is for. 
> also keep in mind again I am not a scoring or IR guy, the only thing i can really bring
to the table here is the willingness to do a lot of relevance testing!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message