lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene
Date Mon, 30 Nov 2009 04:45:20 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783532#action_12783532
] 

Robert Muir edited comment on LUCENE-2091 at 11/30/09 4:45 AM:
---------------------------------------------------------------

otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations
Lucene SimpleAnalyzer
Lucene SimpleAnalyzer + BM25
Lucene PersianAnalyzer
Lucene PersianAnalyzer + BM25

the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter
is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some
of these words.

this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't
completely make up for lack of support like this.

btw, you can play around with openrelevance svn and duplicate my experiments on this same
corpus yourself if you want. there's an indonesian corpus there too. i've also tested hindi
with this impl.


      was (Author: rcmuir):
    otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations
Lucene SimpleAnalyzer
Lucene SimpleAnalyzer + BM25
Lucene PersianAnalyzer
Lucene PersianAnalyzer + BM25

the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter
is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some
of these words.

this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't
completely make up for lack of support like this.
  
> Add BM25 Scoring to Lucene
> --------------------------
>
>                 Key: LUCENE-2091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2091
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Yuval Feinstein
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring
in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message