lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joaquin Perez-Iglesias (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene
Date Tue, 16 Feb 2010 20:47:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834455#action_12834455
] 

Joaquin Perez-Iglesias commented on LUCENE-2091:
------------------------------------------------

It is a consequence of the logarithm, you can get negative numbers, and a negative score doesn't
have to much sense. As far as I know this version of IDF is pretty theoretical and based on
the binary independence model (BIR), so transform the products of probabilities into a summation
of logarithms. Anyway it is quite usual to add a 1 to the final result before applying the
logarithm to avoid situations like previous.

In my opinion it should be added to the patch. It doesn't hurt but it helps :-)

This stuff is clearly explained on the wikipedia http://en.wikipedia.org/wiki/Okapi_BM25.

Just a quote from Wikipedia
{quote}
Please note that the above formula for IDF shows potentially major drawbacks when using it
for terms appearing in more than half of the corpus documents. These terms' IDF is negative,
so for any two almost-identical documents, one which contains the term and one which does
not contain it, the latter will possibly get a larger score. This means that terms appearing
in more than half of the corpus will provide negative contributions to the final document
score. This is often an undesirable behavior, so many real-world applications would deal with
this IDF formula in a different way:

    * Each summand can be given a floor of 0, to trim out common terms;
    * The IDF function *can be given a floor of a constant ε,* to avoid common terms being
ignored at all;
    * The IDF function can be replaced with a similarly shaped one which is non-negative,
or strictly positive to avoid terms being ignored at all.

{quote}

> Add BM25 Scoring to Lucene
> --------------------------
>
>                 Key: LUCENE-2091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2091
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Yuval Feinstein
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring
in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message