Hello Lucene users,
in the past, I asked a number of times about the scoring that was applied for Lucene 1.2 (which might also still be valid in current Lucene versions). At that time I was interested only based on curiosity, but now I would need it in order to write proper documentation.
At that time, I found answer on a higher level with the kind help of Joaquin Delgado in his posting ( http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200609.mbox/%3C45043CC5.3080103@oracle.com%3E ) who pointed me to this mailing list contribution ( http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/%3C000501c34ced$1f3b5c90$0500a8c0@ki%3E ).
According to these sources, the Lucene scoring formula in version 1.2 is:
score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
where
* score (q,d) : score for document d given query q
* sum_t : sum for all terms t in q
* tf_q : the square root of the frequency of t in q
* tf_d : the square root of the frequency of t in d
* idf_t : log(numDocs/docFreq_t+1) + 1.0
* numDocs : number of documents in index
* docFreq_t : number of documents containing t
* norm_q : sqrt(sum_t((tf_q*idf_t)^2))
* norm_d_t : square root of number of tokens in d in the same field
as t
* boost_t : the user-specified boost for term t
* coord_q_d : number of terms in both query and document / number of
terms in query The coordination factor gives an AND-like boost to
documents that contain, e.g., all three terms in a three word
query over those that contain just two of the words.
This will allow me now to include the scoring formula as part of a documentation which will of great help. For verification, I have attached the formula as a picture generated from LaTeX. Please let me know if you find any mistake or if it think the formula could be simplified (I am not a mathematician...).
For even deeper understanding, I would like to ask a few further questions. I am not an expert in Information Retrieval, so I hope my questions are not too basic to be embarrassing. I read the paper by Erica Chisholm and Tamara G. Kolda (http://citeseer.ist.psu.edu/rd/12896645%2C198082%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/8004/http:zSzzSzwww.ca.sandia.govzSz%7EtgkoldazSzpaperszSzornl-tm-13756.pdf/new-term-weighting-formulas.pdf) to get a better idea of what kind of vector space scoring strategies exist in order to compare the Lucene scoring a bit with the rest of the world. My aim is basically to understand the strategic decisions that where made in Lucene (version 1.2). I have 3 questions:
1) tf_q and tf_d, basically all the term frequencies (TF) in the formula, are square roots in order to normalise the bias from large term frequencies. Looking through a number of IR papers, it seems that the "normal" way of normalising TF is log. What is the motivation for choosing square root instead? Is there a simple mathematical reason, or is there any empirical evidence that this is the better strategy. Are there any papers that argue for this decision (perhaps with empirical data or otherwise)?
2) In Lucenes scoring algorithm, the query part is normalised with norm_q which is sqrt(sum_t((tf_q*idf_t)^2)). In standard IR literature, this is referred to as Cosine Normalisation. The SMART system used this normalisation strategy, however only for the documents, not for the query. Queries were not normalised at all. The document terms in Lucene, on the other side, are only normalised with norm_d_t : which is the square root of the number of tokens in d (which are also terms in my case) in the same field as t. On this I have two sub questions:
2a) Why does Lucene normalise with Cosine Normalisation for the query? In a range of different IR system variations (as shown in Chisholm and Kolda's paper in the table on page 8) queries where not normalised at all. Is there a good reason or perhaps any empirical evidence that would support this decision?
2b) What is the motivation for the normalisation imposed on the documents (norm_d_t) which I have not seen before in any other system. Again, does anybody have pointers to literature on this?
3) What is the motivation for the additional normalisation of coord_q_d despite what is already described above? Again, is there any literature that argues this normalisation?
The answer of these questions would greatly help me to link this scoring formula with other IR strategies. This would help me to appreciate the value of this great IR library even more.
Any answer or partial answer on any of the questions would be greatly appreciated!
Best Regards and thanks in advance!
Karl
--
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!