lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: How to pull document scoring values
Date Wed, 29 Sep 2004 07:17:21 GMT
Zia,

On Tuesday 28 September 2004 21:22, you wrote:
> Hi,
>
> I'm trying to learn the Scoring mechanism of Lucene. I want to fetch
> each parameter value individually as they are collectively dumped out by
> Explanation. I've managed to pull out TF and IDF values using
> DefaultSimilarity and FilterIndexReader, but not sure from where to get
> the fieldNorm and queryNorm from.

The norms are here:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#norms(java.lang.String)
The resulting array is indexed by the document number for the IndexReader.
With the default similarity, each norm is the inverse square root of the number of indexed
terms in the 
document field. However, there are only 8 bits available to encode this value, so it's quite
rough.

The default queryNorm is here:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
There is an explanation of the scoring in the javadocs of Similarity.
There has been some discussion on an idf factor that was missing from this documentation,

I don't know whether the docs have been adapted for this.

> Also is there any reference about how normalisation has been
> implemented?

See above, DefaultSimilarity is the default implementation of the Similarity interface.
queryNorm() takes a sumOfSquaredWeights, where the weights are the term weights
from the query. It returns the square root.

It may be that the sum of squared weights should be a sum of square rooted weights
and that queryNorm should return a square then.
I posted this on lucene-user on 20 September:
http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgNo=10023

It's only a normalisation, so it doesn't affect the order of the search results much.
Taking the square roots of the  query term weights would have
the query weights directly apllied to the the query term density in the document field,
whereas now the weights seem to be applied to the square root of the density.
The density value is an approximation, see above for the rough field norms.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message