lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
Date Fri, 01 Apr 2011 13:17:05 GMT


Robert Muir commented on LUCENE-2959:

One thing that is not clear for me is why these limitations would not be a problem for BM25.
As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs
and document length from the whole document – which, according to what you said, is not
available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward
to implement.

A variant sounds really interesting? I think you know better than me here, I just looked at
the original paper and thought to myself that to implement this "by the book" might not be
feasible for a while.

Robert, would you be so kind to have a look at my proposal? It can be found at
It's basically the same as what I sent to the mailing list. I wrote that I want to implement
BM25, BM25F and DFR ("the framework", I meant with one or two smoothing models), as well as
to convert the original scoring to the new framework. In light of the thread here, I guess
it would be better to modify these goals, perhaps by:

deleting the conversion part?
committing myself to BM25/BM25F only?
explicitly stating that I want a higher level API based on the low-level one?

I think you can decide what you want to do? Obviously I would love to see all of it done :)

But its your choice, I could see you going a couple different ways:
* closer to your original proposal, you could still develop a flexible scoring API on top
of Similarity. Hey, all I did was move stuff from Scorer to Similarity really, which does
give flexibility, but its probably not what an IR researcher would want (its low-level and
confusing). So you could make a "SimpleSimilarity" or "EasySimilarity" or something thats
presents a much simpler API (something closer to what terrier/indri present) on top of this,
for easily implementing ranking functions? I think this would be extremely valuable long-term:
who cares if we have a low-level flexible scoring API that only speed demons like, but IR
practitioners find confusing and hideous? Someone who is trying to experiment with an enhancement
to relevance likely doesn't care if their TREC run takes 30 seconds instead of 20 seconds
if the API is really easy and they aren't wasting time fighting with lucene? If you go this
route, you could implement BM25, DFR, etc as you suggested as examples to how to use this
API, and there would be more of a focus on API quality and simplicity instead of performance.
* or alternatively, you could refine your proposal to implement a really "production strength"
version of one of these scoring systems on top of the low-level API, that would ideally have
competitive performance/documentation/etc with Lucene's default scoring today. If you decide
to do this, then yes, I would definitely suggest picking only one, because I think its a ton
of work as I listed above, and I think there would be more focus on practical things (some
probably being nuances of lucene) and performance.

> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>                 Key: LUCENE-2959
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture
> tailored specically to VSM, which makes the addition of new ranking functions a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to implement
> query architecture with pluggable ranking functions.

This message is automatically generated by JIRA.
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message