lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <ian.sobor...@nist.gov>
Subject Re: Question about scoring normalisation
Date Mon, 07 Nov 2005 18:42:40 GMT
"Karl Koch" <TheRanger@gmx.net> writes:

> I am not sure if I know exactly what pivoted normalisation is. I can tell
> you what I do, in the meantime I will have a look to your paper and I hope
> that we can discuss this issue further.

Sort answer on pivoted document length normalization.  You'll notice
that the Lucene scoring function includes a normalization for document
length.  This is because, in general, just using tf and idf will
result in a bias towards long documents, which contain more terms.

The standard cosine normalization controls for this, but too
much... if you plot the probability of retrieval against the
probability of relevance (using a test collection), you can see that
cosine is too biased towards short documents.

Pivoted normalization learns a scaling factor to correct for this.  In
the original formulation (Sighal et al, SIGIR '96) the length was
based on words in the document, but later the byte length was used.

Occasionally, you will see people blindly using pdl constants from
some paper in their own collection without actually trying to measure
what they should be.  This is likely to screw things up.

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message