What do you mean by mathematically correct? Is there something incorrect
in the book?
According to a message posted some time ago at
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/%3C000501c34ced$1f3b5c90$0500a8c0@ki%3E
, where people first noticed a change in the scoring algorithm, the
official FAQ (for 1.2) had posted, from Doug himself the following formula:
score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
where
* score (q,d) : score for document d given query q
* sum_t : sum for all terms t in q
* tf_q : the square root of the frequency of t in q
* tf_d : the square root of the frequency of t in d
* idf_t : log(numDocs/docFreq_t+1) + 1.0
* numDocs : number of documents in index
* docFreq_t : number of documents containing t
* norm_q : sqrt(sum_t((tf_q*idf_t)^2))
* norm_d_t : square root of number of tokens in d in the same field
as t
* boost_t : the user-specified boost for term t
* coord_q_d : number of terms in both query and document / number of
terms in query The coordination factor gives an AND-like boost to
documents that contain, e.g., all three terms in a three word
query over those that contain just two of the words.
This is diffirent that the current scoring algorithm described at
http://lucene.apache.org/java/docs/scoring.html#Scoring which includes
field boosting, document length normalization, etc.
In any case these are variations of the TF-IDF weighted vector space
"cosine of the angle" between the document and the query vectors (also
known as cosine distance or normalized dot product - see
http://en.wikipedia.org/wiki/Dot_product). This computation treats
documents and queries as vectors in an N-dimensional space (N is the
number of unique terms excluding stopwords).
In statistics/probabilistc terms this can also be interpretated as a
geometrical interpretation of correlation between samples drawn from
two random variables Q and D (representing a query and a document -see
http://en.wikipedia.org/wiki/Correlation) whereas each data point
(TF-IDF weight) is an estimation of how much "information" each term
conveys. There are more complex probabilistc rankings algorithms which
take advantage of previous knowledge of relevance (pre-ranked documents
for example) in its computation primarily exploiting bayes theorem.
Both Vector Space Model and Probabilistic Model are well studied in
Information Retrieval Literature. See
http://www2.sims.berkeley.edu/courses/is202/f00/lectures/Lecture8_202.ppt
for an overview of Ranking and Feedback.
-- Joaquin Delgado
Karl Koch wrote:
>Hi,
>
>I am looking for a mathematically correct IR scoring formula for Lucene 1.2. The description in the book (Lucene in Action, 2005 edition) is rather non-mathematical, also I am not sure if this is the one that also counts for Lucene 1.2 and not for later versions.
>
>Perhaps Eric or Otis can directy comment on this? Is there any paper on the Lucene scoring algorithm that was published and describes the formula in depth?
>
>Best Regards,
>Karl
>
>