lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <yuv...@answers.com>
Subject RE: BM25 Scoring Patch
Date Thu, 18 Feb 2010 16:20:29 GMT
-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Thursday, February 18, 2010 3:09 PM
To: java-user@lucene.apache.org
Subject: Re: BM25 Scoring Patch

Yuval, don't we still need this 'document-level IDF' for BM25f?

- Yes, we do need 'document-level IDF' for BM25f.  
- Joaquin tried to bypass this by using the IDF of the field having the longest average length
instead
- of the document's IDF.
- This introduces some bias into the scoring formula, but maybe it is not too large...

On Thu, Feb 18, 2010 at 3:45 AM, Yuval Feinstein <yuvalf@answers.com> wrote:

> We could solve this by saying we only incorporate BM25F into Lucene.
> This is a field-based scoring method, so it saves us the need to deal with
> documents.
> Building on Joaquin's work, the extra parts needed IMO are:
> a. Support for storing average length per field during indexing. I think I
> saw some reference to this
> when Grant described the new features in Lucene 2.9. We need to store two
> numbers (say
> number of documents containing the field and average length) to support
> incremental indexing.
> b. Easy integration of BM25F similarity - default parameter values, working
> with regular Lucene class hierarchy.
> c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We
> could do this incrementally,
> throwing an "UnsupportedOperationException" in the meantime).
> d. Some work on run-time efficiency, to be near the efficiency of the
> default scoring.
> I could do some of this work myself, but guidance from a Lucene scoring
> guru would be a great help.
> Thanks,
> Yuval
>
Mime
View raw message