Hi Robert,
thank you very much for your quick response, I have a couple of questions,
did you read the papers that I mention in my e-mail?
do you think that Lucene ranking function could have this problem?
My concern is not about how to implement different kind of ranking
functions for Lucene, I know that you are doing a very nice work to
implement a very flexible ranking framework for Lucene, my concern is
about a bug, which is independent of the ranking function that you are
using and which appears whether some kind of saturation function is
used in combination with a linear combination of fields for structured
documents.
Maybe I'm wrong, but if the linear combination of fields remains in
lucene ranking function core, Lucene is never going to work properly
to compute the score for structured documents.
I know how to solve the problem, and we have our own implementation of
BM25F for Lucene which performance is much better that standard
Lucene's ranking function, but I think that would be useful for other
Lucene users to know what is the problem to deal with structured
documents, and how to fix this problem for the next version,
independently what ranking function is finally implemented for Lucene.
jose
On Wed, May 5, 2010 at 1:38 PM, Robert Muir wrote:
> José, you might want to watch LUCENE-2392.
>
> In this issue, we are proposing adding additional flexibility to the scoring
> mechanism including:
> * controlling scoring on a per-field basis
> * the ability to compute and use aggregate statistics (average field length,
> total TF across all docs)
> * fine-grained calculation of the score: essentially at the end of the day
> if you want, you can implement score() in your Similarity and do whatever
> you want, so things like tf() and idf() as methods "go away" in that they
> might not even make sense for your scorer. So, SimilarityProvider in this
> model gets the flexibility of Scorer hopefully without the hassles.
>
> As far as combining scores across fields, I do not see why
> 2010/5/5 José Ramón Pérez Agüera
>
>> Hi all,
>>
>> We realize that there is a bug in Lucene's ranking function. Most
>> ranking functions, use a non-linear method to saturate the computation
>> of the frequencies.
>> This is due to the fact that the information gained on observing a
>> term the first time is greater than the information gained on
>> subsequently seeing the same term. The non-linear method can be as
>> simple as a logarithmic or a square-root function or more complex
>> parameter-based approaches like BM25 k1 parameter. S. Robertson 2004
>> http://portal.acm.org/citation.cfm?id=1031181 has described the
>> dangers to combine scores from different document fields and what are
>> the most tipical errors when ranking functions are modified to
>> consider the structure of the documents.
>>
>> To rank these structured documents, Lucene combines the scores from
>> document fields. The method used by Lucene to compute the score of an
>> structured document is based on the linear combination of the scores
>> for each field of the document.
>>
>> Lucene's ranking function uses the square root of the term frequency
>> to implement the non-linear method to saturate the computation of the
>> frequencies, but the linear combination of the scores by field to
>> compute the score for the whole document that Lucene implements breaks
>> the saturation effect, since field's boost factors are applied after
>> of non-linear methods are used. The consequence is that a document
>> matching a single query term over several fields could score much
>> higher than a document matching several query terms in one field only,
>> which is not a good way to compute relevance and use to hurt
>> dramatically ranking function performance.
>>
>> We have written a paper where this problem is described and some
>> experiments are carried out to show the effect in Lucene performance.
>> http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf
>>
>> It would be possible to fix this problem to have Lucene working
>> properly for structured documents?
>>
>> thank you very much in advance
>>
>> jose
>>
>> --
>> Jose R. Pérez-Agüera
>>
>> Clinical Assistant Professor
>> Metadata Research Center
>> School of Information and Library Science
>> University of North Carolina at Chapel Hill
>> email: jaguera@email.unc.edu
>> Web page: http://www.unc.edu/~jaguera/
>> MRC website: http://ils.unc.edu/mrc/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
--
Jose R. Pérez-Agüera
Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
email: jaguera@email.unc.edu
Web page: http://www.unc.edu/~jaguera/
MRC website: http://ils.unc.edu/mrc/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org