lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <yuv...@answers.com>
Subject RE: BM25 Scoring Patch
Date Thu, 18 Feb 2010 08:45:18 GMT
We could solve this by saying we only incorporate BM25F into Lucene.
This is a field-based scoring method, so it saves us the need to deal with documents.
Building on Joaquin's work, the extra parts needed IMO are:
a. Support for storing average length per field during indexing. I think I saw some reference
to this
when Grant described the new features in Lucene 2.9. We need to store two numbers (say
number of documents containing the field and average length) to support incremental indexing.
b. Easy integration of BM25F similarity - default parameter values, working with regular Lucene
class hierarchy.
c. Support for all regular query types - PhraseQuery, FuzzyQuery etc. (We could do this incrementally,
throwing an "UnsupportedOperationException" in the meantime).
d. Some work on run-time efficiency, to be near the efficiency of the default scoring.
I could do some of this work myself, but guidance from a Lucene scoring guru would be a great
help.
Thanks,
Yuval

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, February 17, 2010 6:47 PM
To: java-user@lucene.apache.org
Subject: Re: BM25 Scoring Patch

I tend to agree with you Marvin, you are right, the different scoring
mechanisms need different information available and this is the problem.

although last I checked, one hard part of BM25 rotates around fields versus
documents... e.g. BM25's IDF calculation.

but maybe this is just an extreme form of your example :)

On Wed, Feb 17, 2010 at 11:39 AM, Marvin Humphrey <marvin@rectangular.com>wrote:

> On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote:
> > yet if we don't do the hard work up front to make it easy to plug in
> things
> > like BM25, then no one will implement additional scoring formulas for
> > Lucene, we currently make it terribly difficult to do this.
>
> FWIW... Similarity and posting format spec are so closely tied that I'm
> considering linking them in Lucy.
>
>  Schema schema = new Schema();
>  FullTextType bm25Type = new FullTextType(new BM25Similarity());
>  schema.specField("content", bm25Type);
>  schema.specField("title", bm25Type);
>  StringType matchType = new StringType(new MatchSimilarity());
>  schema.specField("category", matchType);
>
> That way, custom scoring implementations can guarantee that they always
> have
> the posting information they need available to make their similarity
> judgments.  Similarity also becomes a more generalized notion, with the
> TF/IDF-specific functionality moving into a subclass.
>
> Maybe something similar could be made to work in Lucene.  Dunno how
> McCandless
> has things set up for spec'ing codecs on the flex branch.
>
> Marvin Humphrey
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com
Mime
View raw message