Hello Lucene users,
in the past, I asked a number of times about the scoring that was applied for Lucene 1.2 (which
might also still be valid in current Lucene versions). At that time I was interested only
based on curiosity, but now I would need it in order to write proper documentation.
At that time, I found answer on a higher level with the kind help of Joaquin Delgado in his
posting ( http://mailarchives.apache.org/mod_mbox/lucenejavadev/200609.mbox/%3C45043CC5.3080103@oracle.com%3E
) who pointed me to this mailing list contribution ( http://mailarchives.apache.org/mod_mbox/lucenejavadev/200307.mbox/%3C000501c34ced$1f3b5c90$0500a8c0@ki%3E
).
According to these sources, the Lucene scoring formula in version 1.2 is:
score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
where
* score (q,d) : score for document d given query q
* sum_t : sum for all terms t in q
* tf_q : the square root of the frequency of t in q
* tf_d : the square root of the frequency of t in d
* idf_t : log(numDocs/docFreq_t+1) + 1.0
* numDocs : number of documents in index
* docFreq_t : number of documents containing t
* norm_q : sqrt(sum_t((tf_q*idf_t)^2))
* norm_d_t : square root of number of tokens in d in the same field
as t
* boost_t : the userspecified boost for term t
* coord_q_d : number of terms in both query and document / number of
terms in query The coordination factor gives an ANDlike boost to
documents that contain, e.g., all three terms in a three word
query over those that contain just two of the words.
This will allow me now to include the scoring formula as part of a documentation which will
of great help. For verification, I have attached the formula as a picture generated from LaTeX.
Please let me know if you find any mistake or if it think the formula could be simplified
(I am not a mathematician...).
For even deeper understanding, I would like to ask a few further questions. I am not an expert
in Information Retrieval, so I hope my questions are not too basic to be embarrassing. I read
the paper by Erica Chisholm and Tamara G. Kolda (http://citeseer.ist.psu.edu/rd/12896645%2C198082%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/8004/http:zSzzSzwww.ca.sandia.govzSz%7EtgkoldazSzpaperszSzornltm13756.pdf/newtermweightingformulas.pdf)
to get a better idea of what kind of vector space scoring strategies exist in order to compare
the Lucene scoring a bit with the rest of the world. My aim is basically to understand the
strategic decisions that where made in Lucene (version 1.2). I have 3 questions:
1) tf_q and tf_d, basically all the term frequencies (TF) in the formula, are square roots
in order to normalise the bias from large term frequencies. Looking through a number of IR
papers, it seems that the "normal" way of normalising TF is log. What is the motivation for
choosing square root instead? Is there a simple mathematical reason, or is there any empirical
evidence that this is the better strategy. Are there any papers that argue for this decision
(perhaps with empirical data or otherwise)?
2) In Lucenes scoring algorithm, the query part is normalised with norm_q which is sqrt(sum_t((tf_q*idf_t)^2)).
In standard IR literature, this is referred to as Cosine Normalisation. The SMART system used
this normalisation strategy, however only for the documents, not for the query. Queries were
not normalised at all. The document terms in Lucene, on the other side, are only normalised
with norm_d_t : which is the square root of the number of tokens in d (which are also terms
in my case) in the same field as t. On this I have two sub questions:
2a) Why does Lucene normalise with Cosine Normalisation for the query? In a range of different
IR system variations (as shown in Chisholm and Kolda's paper in the table on page 8) queries
where not normalised at all. Is there a good reason or perhaps any empirical evidence that
would support this decision?
2b) What is the motivation for the normalisation imposed on the documents (norm_d_t) which
I have not seen before in any other system. Again, does anybody have pointers to literature
on this?
3) What is the motivation for the additional normalisation of coord_q_d despite what is already
described above? Again, is there any literature that argues this normalisation?
The answer of these questions would greatly help me to link this scoring formula with other
IR strategies. This would help me to appreciate the value of this great IR library even more.
Any answer or partial answer on any of the questions would be greatly appreciated!
Best Regards and thanks in advance!
Karl

"Ein Herz für Kinder"  Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's CupYacht!
