lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From python2...@yahoo.com
Subject Scoring function in LMDirichletSimilarity Class
Date Fri, 29 Mar 2013 21:21:04 GMT
Hi,
 
Can anyone help me understand the scoring function in the LMDirichletSimilarity class? 
 
The scoring function in LMDirichletSimilarity is shown below:
-------------------------------------------------------------------------------------------
float score = stats.getTotalBoost() * (float)(
    Math.log(1 + freq /(mu * ((LMStats)stats).getCollectionProbability())) +
 
    Math.log(mu / (docLen + mu))
);
-------------------------------------------------------------------------------------------
 
The math formula of the highlighted part above is log[ (tf + mu * P(w|C)) / (docLen + mu)
/ P(w|C)], which, in terms of scoring, should be equivalent to 
-------------------------------------------------------------------------------------------
return score = (float) ( (freq + mu * ((LMStats)stats).getCollectionProbability()) / (docLen
+ mu) ); 
-------------------------------------------------------------------------------------------
which is written exactly according to textbook/paper because the division by P(w|C) is same
for all documents. However, I'm getting much worse results by using the second piece of code.

Can anyone help explain why this is happening? Am I missing something about the scoring?
 
 
Thanks,
Dong
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message