lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <TheRan...@gmx.net>
Subject Lucene scoring: Term frequency normalisation
Date Tue, 12 Dec 2006 10:23:41 GMT
Hi,

I have a question about the current Lucene scoring algoritm. In this scoring algorithm, the
term frequency is calcualted by using the square root of the number of occuring terms as described
in 

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_tf

Having read a number of IR papers and also in a number of IR books, I am quite familiar that
log is used to normalise term frequency in order to prevent very high term frequencies from
having too much an effect on the scoring. 

However, what exactly is the advantage of using sqare root instead of log? Is there any scientific
reason behind this? Does anybody know a paper about this issue? Any source of impirical evidence
that this works better than the log? Is there perhaps another discussion thread in here which
I have not seen. 

Thank you advance,
Karl 


-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message