lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zia Syed <zia.s...@smartweb.rgu.ac.uk>
Subject Re: How to pull document scoring values
Date Wed, 29 Sep 2004 13:41:26 GMT
Hi Paul,
Thanks for your detailed reply! It really helped alot.
However, I am experiancing some conflicts.

For one of the documents in result set, when i use 

IndexReader fir=FilterIndexReader.open("index");
byte[] fNorm=fir.norm("Body");
System.out.println("FNorm: "+ fNorm[306]);
Document d=fir.document(306);
Field f=d.getField("Body");

System.out.println("Body: "+ f.stringValue());

This gives me out fNorm 113, whereas total number of term (including
stop-words) are 42 in this particular field of selected document. In the
explanation , fieldNorm (field=Body, doc=306) is 0.1562, which is approx
41 term words for that field in that documents. So explanation values
makes sense with real data, while including all stop words like to,it,
the & etc. 

So, my question is, 
> Am i getting the norm values from right place?
> Is there any way to find out number of indexed terms for each
document?

Please advise!

Thanks,

Zia



On Wed, 2004-09-29 at 08:17, Paul Elschot wrote:
> Zia,
> 
> On Tuesday 28 September 2004 21:22, you wrote:
> > Hi,
> >
> > I'm trying to learn the Scoring mechanism of Lucene. I want to fetch
> > each parameter value individually as they are collectively dumped out by
> > Explanation. I've managed to pull out TF and IDF values using
> > DefaultSimilarity and FilterIndexReader, but not sure from where to get
> > the fieldNorm and queryNorm from.
> 
> The norms are here:
> http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#norms(java.lang.String)
> The resulting array is indexed by the document number for the IndexReader.
> With the default similarity, each norm is the inverse square root of the number of indexed
terms in the 
> document field. However, there are only 8 bits available to encode this value, so it's
quite rough.
> 
> The default queryNorm is here:
> http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
> There is an explanation of the scoring in the javadocs of Similarity.
> There has been some discussion on an idf factor that was missing from this documentation,

> I don't know whether the docs have been adapted for this.
> 
> > Also is there any reference about how normalisation has been
> > implemented?
> 
> See above, DefaultSimilarity is the default implementation of the Similarity interface.
> queryNorm() takes a sumOfSquaredWeights, where the weights are the term weights
> from the query. It returns the square root.
> 
> It may be that the sum of squared weights should be a sum of square rooted weights
> and that queryNorm should return a square then.
> I posted this on lucene-user on 20 September:
> http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgNo=10023
> 
> It's only a normalisation, so it doesn't affect the order of the search results much.
> Taking the square roots of the  query term weights would have
> the query weights directly apllied to the the query term density in the document field,
> whereas now the weights seem to be applied to the square root of the density.
> The density value is an approximation, see above for the rough field norms.
> 
> Regards,
> Paul Elschot
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
-- 
Zia Syed <zia.syed@smartweb.rgu.ac.uk>
Smartweb Research Center, Robert Gordon University


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message