lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene
Date Tue, 16 Feb 2010 20:09:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834421#action_12834421
] 

Robert Muir edited comment on LUCENE-2091 at 2/16/10 8:09 PM:
--------------------------------------------------------------

Joaquin, have you seen this paper: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

Its of interest how they modified BM25's idf formula slightly in a way to improve results
when no stopwords list is used. I'm curious what you think about this as it looks like a potential
improvement for people not using stopwords (multilingual situation, etc)

edit here is the quote: for simplicity

{noformat}
Using the original idf formula
idf =log[(n−dfj +0.5)/(dfj +0.5)], we have noticed
that when the underlying term tj occurs in more than half of
the documents (dfj >n/2), the resulting idf value would be
negative, and the final document score also could be negative.
As a means of estimating idf,we therefore suggest a new variant
defined as idf =log{1+[(n−dfj +0.5)/(dfj +0.5)]}.
{noformat}


      was (Author: rcmuir):
    Joaquin, have you seen this paper: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

Its of interest how they modified BM25's idf formula slightly in a way to improve results
when no stopwords list is used. I'm curious what you think about this as it looks like a potential
improvement for people not using stopwords (multilingual situation, etc)

  
> Add BM25 Scoring to Lucene
> --------------------------
>
>                 Key: LUCENE-2091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2091
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Yuval Feinstein
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring
in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message