lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: problem in Lucene's ranking function
Date Wed, 05 May 2010 18:12:51 GMT
2010/5/5 José Ramón Pérez Agüera <jose.aguera@gmail.com>

> Hi Robert,
>
> thank you very much for your quick response, I have a couple of questions,
>
> did you read the papers that I mention in my e-mail?
>

Yes.


> do you think that Lucene ranking function could have this problem?
>
>
I know it does.


> My concern is not about how to implement different kind of ranking
> functions for Lucene, I know that you are doing a very nice work to
> implement a very flexible ranking framework for Lucene, my concern is
> about a bug, which is independent of the ranking function that you are
> using and which appears whether some kind of saturation function is
> used in combination with a linear combination of fields for structured
> documents.
>

I think we might disagree here though. Must 'the combining of scores from
different fields' must be hardcoded to one simple solution, or should it be
something that you can control yourself?

For example, it appears Terrier implements something different for this
problem, not the paper you referenced but a different technique?:
http://terrier.org/docs/v3.0/javadoc/org/terrier/matching/models/BM25F.html But
I don't quite understand all the subleties involved... it seems in this
other paper there is still a linear combination, but you introduce
additional per-field parameters.

The thing that makes me nervous about "hardcoding/changing" the way that
scores are combined across fields is that Lucene presents some strange
peculiarities, most notably the ability to use different scoring models for
different fields. This in fact already exists today, if you "omitTF" for one
field but not for another, you are using a different scoring model for the
two fields.


> Maybe I'm wrong, but if the linear combination of fields remains in
> lucene ranking function core, Lucene is never going to work properly
> to compute the score for structured documents.
>

I wouldn't say never, maybe we will not get there in the first go, but
hopefully at least you will be able to do the things i mentioned above, such
as using different similarities for different fields, including ones that
are not supported today.


>
> I know how to solve the problem, and we have our own implementation of
> BM25F for Lucene which performance is much better that standard
> Lucene's ranking function, but I think that would be useful for other
> Lucene users to know what is the problem to deal with structured
> documents, and how to fix this problem for the next version,
> independently what ranking function is finally implemented for Lucene.
>
>
It would be great if you could help us on that issue (I know the patch is a
bit out of date), to try to fix the scoring APIs, including perhaps thinking
about how to improve search across multiple fields for structured documents.

In my opinion, I would like to see the situation evolve away from "which
ranking function is implemented for Lucene" instead to having a variety of
built-in functions you can choose from.

So, I would rather it be more like Analyzers, where we have a variety of
high-quality implementations available, and you can make your own if you
must, but there is no real default.

-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message