# lucene-dev mailing list archives

##### Site index · List index
Message view
Top
Subject Re: Lucene 1.2 - scoring forumla needed
Date Sun, 10 Sep 2006 16:26:45 GMT
```What do you mean by mathematically correct? Is there something incorrect
in the book?

According to a message posted some time ago at
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/%3C000501c34ced\$1f3b5c90\$0500a8c0@ki%3E

, where people first noticed a change in the scoring algorithm, the
official FAQ (for 1.2) had posted, from Doug himself the following formula:

score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d

where

* score (q,d) : score for document d given query q
* sum_t : sum for all terms t in q
* tf_q : the square root of the frequency of t in q
* tf_d : the square root of the frequency of t in d
* idf_t : log(numDocs/docFreq_t+1) + 1.0
* numDocs : number of documents in index
* docFreq_t : number of documents containing t
* norm_q : sqrt(sum_t((tf_q*idf_t)^2))
* norm_d_t : square root of number of tokens in d in the same field
as t
* boost_t : the user-specified boost for term t
* coord_q_d : number of terms in both query and document / number of
terms in query The coordination factor gives an AND-like boost to
documents that contain, e.g., all three terms in a three word
query over those that contain just two of the words.

This is diffirent that the current scoring algorithm described at
http://lucene.apache.org/java/docs/scoring.html#Scoring which includes
field boosting, document length normalization, etc.

In any case these are variations of the TF-IDF weighted vector space
"cosine of the angle" between the document and the query vectors  (also
known as cosine distance or normalized dot product - see
http://en.wikipedia.org/wiki/Dot_product). This computation treats
documents and queries as vectors in an N-dimensional space (N is the
number of unique terms excluding stopwords).

In statistics/probabilistc terms this can also be interpretated as a
geometrical interpretation of correlation  between samples drawn from
two random variables Q and D (representing a query and a document -see
http://en.wikipedia.org/wiki/Correlation) whereas each data point
(TF-IDF weight)  is an estimation of how much "information" each term
conveys. There are more complex probabilistc rankings algorithms which
take advantage of previous knowledge of relevance (pre-ranked documents
for example) in its computation primarily exploiting bayes theorem.

Both Vector Space Model and Probabilistic Model are well studied in
Information Retrieval Literature. See
http://www2.sims.berkeley.edu/courses/is202/f00/lectures/Lecture8_202.ppt
for an overview of Ranking and Feedback.

Karl Koch wrote:

>Hi,
>
>I am looking for a mathematically correct IR scoring formula for Lucene 1.2. The description
in the book (Lucene in Action, 2005 edition) is rather non-mathematical, also I am not sure
if this is the one that also counts for Lucene 1.2 and not for later versions.
>
>Perhaps Eric or Otis can directy comment on this? Is there any paper on the Lucene scoring
algorithm that was published and describes the formula in depth?
>
>Best Regards,
>Karl
>
>

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message