José, you might want to watch LUCENE2392.
In this issue, we are proposing adding additional flexibility to the scoring
mechanism including:
* controlling scoring on a perfield basis
* the ability to compute and use aggregate statistics (average field length,
total TF across all docs)
* finegrained calculation of the score: essentially at the end of the day
if you want, you can implement score() in your Similarity and do whatever
you want, so things like tf() and idf() as methods "go away" in that they
might not even make sense for your scorer. So, SimilarityProvider in this
model gets the flexibility of Scorer hopefully without the hassles.
As far as combining scores across fields, I do not see why
2010/5/5 José Ramón Pérez Agüera <jose.aguera@gmail.com>
> Hi all,
>
> We realize that there is a bug in Lucene's ranking function. Most
> ranking functions, use a nonlinear method to saturate the computation
> of the frequencies.
> This is due to the fact that the information gained on observing a
> term the first time is greater than the information gained on
> subsequently seeing the same term. The nonlinear method can be as
> simple as a logarithmic or a squareroot function or more complex
> parameterbased approaches like BM25 k1 parameter. S. Robertson 2004
> http://portal.acm.org/citation.cfm?id=1031181 has described the
> dangers to combine scores from different document fields and what are
> the most tipical errors when ranking functions are modified to
> consider the structure of the documents.
>
> To rank these structured documents, Lucene combines the scores from
> document fields. The method used by Lucene to compute the score of an
> structured document is based on the linear combination of the scores
> for each field of the document.
>
> Lucene's ranking function uses the square root of the term frequency
> to implement the nonlinear method to saturate the computation of the
> frequencies, but the linear combination of the scores by field to
> compute the score for the whole document that Lucene implements breaks
> the saturation effect, since field's boost factors are applied after
> of nonlinear methods are used. The consequence is that a document
> matching a single query term over several fields could score much
> higher than a document matching several query terms in one field only,
> which is not a good way to compute relevance and use to hurt
> dramatically ranking function performance.
>
> We have written a paper where this problem is described and some
> experiments are carried out to show the effect in Lucene performance.
> http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf
>
> It would be possible to fix this problem to have Lucene working
> properly for structured documents?
>
> thank you very much in advance
>
> jose
>
> 
> Jose R. PérezAgüera
>
> Clinical Assistant Professor
> Metadata Research Center
> School of Information and Library Science
> University of North Carolina at Chapel Hill
> email: jaguera@email.unc.edu
> Web page: http://www.unc.edu/~jaguera/
> MRC website: http://ils.unc.edu/mrc/
>
> 
> To unsubscribe, email: javauserunsubscribe@lucene.apache.org
> For additional commands, email: javauserhelp@lucene.apache.org
>
>

Robert Muir
rcmuir@gmail.com
