lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: problem in Lucene's ranking function
Date Wed, 05 May 2010 17:38:36 GMT
José, you might want to watch LUCENE-2392.

In this issue, we are proposing adding additional flexibility to the scoring
mechanism including:
* controlling scoring on a per-field basis
* the ability to compute and use aggregate statistics (average field length,
total TF across all docs)
* fine-grained calculation of the score: essentially at the end of the day
if you want, you can implement score() in your Similarity and do whatever
you want, so things like tf() and idf() as methods "go away" in that they
might not even make sense for your scorer. So, SimilarityProvider in this
model gets the flexibility of Scorer hopefully without the hassles.

As far as combining scores across fields, I do not see why
2010/5/5 José Ramón Pérez Agüera <jose.aguera@gmail.com>

> Hi all,
>
> We realize that there is a bug in Lucene's ranking function. Most
> ranking functions, use a non-linear method to saturate the computation
> of the frequencies.
> This is due to the fact that the information gained on observing a
> term the first time is greater than the information gained on
> subsequently seeing the same term. The non-linear method can be as
> simple as a logarithmic or a square-root function or more complex
> parameter-based approaches like BM25 k1 parameter. S. Robertson 2004
> http://portal.acm.org/citation.cfm?id=1031181 has described the
> dangers to combine scores from different document fields and what are
> the most tipical errors when ranking functions are modified to
> consider the structure of the documents.
>
> To rank these structured documents, Lucene combines the scores from
> document fields. The method used by Lucene to compute the score of an
> structured document is based on the linear combination of the scores
> for each field of the document.
>
> Lucene's ranking function uses the square root of the term frequency
> to implement the non-linear method to saturate the computation of the
> frequencies, but the linear combination of the scores by field to
> compute the score for the whole document that Lucene implements breaks
> the saturation effect, since field's boost factors are applied after
> of non-linear methods are used. The consequence is that a document
> matching a single query term over several fields could score much
> higher than a document matching several query terms in one field only,
> which is not a good way to compute relevance and use to hurt
> dramatically ranking function performance.
>
> We have written a paper where this problem is described and some
> experiments are carried out to show the effect in Lucene performance.
> http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf
>
> It would be possible to fix this problem to have Lucene working
> properly for structured documents?
>
> thank you very much in advance
>
> jose
>
> --
> Jose R. Pérez-Agüera
>
> Clinical Assistant Professor
> Metadata Research Center
> School of Information and Library Science
> University of North Carolina at Chapel Hill
> email: jaguera@email.unc.edu
> Web page: http://www.unc.edu/~jaguera/
> MRC website: http://ils.unc.edu/mrc/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message