lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joaquin Delgado <joaquin.delg...@oracle.com>
Subject Re: Lucene 1.2 - scoring forumla needed
Date Sun, 10 Sep 2006 16:26:45 GMT
What do you mean by mathematically correct? Is there something incorrect 
in the book?

According to a message posted some time ago at 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/%3C000501c34ced$1f3b5c90$0500a8c0@ki%3E

, where people first noticed a change in the scoring algorithm, the 
official FAQ (for 1.2) had posted, from Doug himself the following formula:

score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * 
boost_t) * coord_q_d

where

    * score (q,d) : score for document d given query q
    * sum_t : sum for all terms t in q
    * tf_q : the square root of the frequency of t in q
    * tf_d : the square root of the frequency of t in d
    * idf_t : log(numDocs/docFreq_t+1) + 1.0
    * numDocs : number of documents in index
    * docFreq_t : number of documents containing t
    * norm_q : sqrt(sum_t((tf_q*idf_t)^2))
    * norm_d_t : square root of number of tokens in d in the same field
      as t
    * boost_t : the user-specified boost for term t
    * coord_q_d : number of terms in both query and document / number of
      terms in query The coordination factor gives an AND-like boost to
      documents that contain, e.g., all three terms in a three word
      query over those that contain just two of the words.

This is diffirent that the current scoring algorithm described at 
http://lucene.apache.org/java/docs/scoring.html#Scoring which includes 
field boosting, document length normalization, etc.

In any case these are variations of the TF-IDF weighted vector space 
"cosine of the angle" between the document and the query vectors  (also 
known as cosine distance or normalized dot product - see 
http://en.wikipedia.org/wiki/Dot_product). This computation treats 
documents and queries as vectors in an N-dimensional space (N is the 
number of unique terms excluding stopwords).

In statistics/probabilistc terms this can also be interpretated as a 
geometrical interpretation of correlation  between samples drawn from 
two random variables Q and D (representing a query and a document -see 
http://en.wikipedia.org/wiki/Correlation) whereas each data point 
(TF-IDF weight)  is an estimation of how much "information" each term 
conveys. There are more complex probabilistc rankings algorithms which 
take advantage of previous knowledge of relevance (pre-ranked documents 
for example) in its computation primarily exploiting bayes theorem.

Both Vector Space Model and Probabilistic Model are well studied in 
Information Retrieval Literature. See 
http://www2.sims.berkeley.edu/courses/is202/f00/lectures/Lecture8_202.ppt 
for an overview of Ranking and Feedback.



-- Joaquin Delgado




Karl Koch wrote:

>Hi,
>
>I am looking for a mathematically correct IR scoring formula for Lucene 1.2. The description
in the book (Lucene in Action, 2005 edition) is rather non-mathematical, also I am not sure
if this is the one that also counts for Lucene 1.2 and not for later versions.
>
>Perhaps Eric or Otis can directy comment on this? Is there any paper on the Lucene scoring
algorithm that was published and describes the formula in depth?
>
>Best Regards,
>Karl
>  
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message