lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jayendra patil <jayendra.pa...@gmail.com>
Subject Re: Need help in understanding output of searcher.explain() function
Date Sat, 07 Aug 2010 17:07:40 GMT
Trying to put up an explanation :-

0.022172567 = (MATCH) product of:
 0.07760398 = (MATCH) sum of:
   0.02287053 = (MATCH) weight(payload:ces in 550), product of:
     0.32539415 = queryWeight(payload:ces), product of:
       2.2491398 = *idf*(docFreq=157, maxDocs=551)
       0.14467494 = queryNorm
     0.07028562 = (MATCH) fieldWeight(payload:ces in 550), product of:
       1.0 = *tf(*termFreq(payload:ces)=1)
       2.2491398 = *idf(*docFreq=157, maxDocs=551)
       0.03125 = *fieldNorm*(field=payload, doc=550)
   0.05473345 = (MATCH) weight(payload:deal in 550), product of:
     0.23803486 = queryWeight(payload:deal), product of:
       1.6453081 = *idf(*docFreq=288, maxDocs=551)
       0.14467494 = *queryNorm*
     0.2299388 = (MATCH) fieldWeight(payload:deal in 550), product of:
       4.472136 = tf(termFreq(payload:deal)=20)
       1.6453081 = idf(docFreq=288, maxDocs=551)
       0.03125 = fieldNorm(field=payload, doc=550)
 0.2857143 = coord(2/7)


   1. tf = term frequency in document = measure of how often a term appears
   in the document
      1.

      Implementation: sqrt(freq)

      Implication: the more frequent a term occurs in a document, the
      greater its score

      Rationale: documents which contains more of a term are generally more
      relevant
      2. idf = inverse document frequency = measure of how often the term
   appears across the index
      1.

      Implementation: log(numDocs/(docFreq+1)) + 1

      Implication: the greater the occurrence of a term in different
      documents, the lower its score

      Rationale: common terms are less important than uncommon ones
      3. coord = number of terms in the query that were found in the
   document
      1.

      Implementation: overlap / maxOverlap

      Implication: of the terms in the query, a document that contains more
      terms will have a higher score

      Rationale: self-explanatory
      4. fieldNorm
      1. lengthNorm = measure of the importance of a term according to the
      total number of terms in the field
         1. Implementation: 1/sqrt(numTerms)
         2. Implication: a term matched in fields with less terms have a
         higher score
         3. Rationale: a term in a field with less terms is more important
         than one with more
      2. boost (index) = boost of the field at index-time
         1. Index time boost specified. The fieldNorm value in the score
            would include the same.
         3. boost (query) = boost of the field at query-time
   5. queryNorm = normalization factor so that queries can be compared
      1. queryNorm is not related to the relevance of the document, but
      rather tries to make scores between different queries comparable. It is
      implemented as 1/sqrt(sumOfSquaredWeights)


When you are trying to search for Query: *It is definitely a CES deal that
will be over in Sep or Oct of this year.*

1. Lucene would try to match each word in our query in each field that you
have specified to be searched on e.g. payload in your case.
2. In your example, it found match only on ces and deal, hence only the two
items are displayed.
3. The number of matches in the particular field also contributes to
the 0.2857143 = coord(*2*/7) - 2 words out of 7
4. *idf*(docFreq=157, maxDocs=551) - specified the rarity. The docfreq
specifies the number of documents which have the word in the field with the
maxdocs represents the total number of documents.
5. *tf(*termFreq(payload:ces)=1) - Specifies the number of times it occurs
e.g. 1 in this case.
6. The Score for each field match is the product of the

0.02287053 = (MATCH) weight(payload:ces in 550), product of:

               Field boost and idf

0.32539415 = queryWeight(payload:ces), product of:

*      1 = boost (**The boost if your case seems to be 1 and hence is not
included in the score.**)*

       2.2491398 = idf(docFreq=157, maxDocs=551)

       0.14467494 = queryNorm

               term frequency, idf and field norm

0.07028562 = (MATCH) fieldWeight(payload:ces in 550), product of:

       1.0 = *tf(*termFreq(payload:ces)=1)

       2.2491398 = *idf(*docFreq=157, maxDocs=551)

       0.03125 = *fieldNorm*(field=payload, doc=550)



Regards,
Jayendra

On Sat, Aug 7, 2010 at 11:02 AM, Soby Thomas <soby.thomas85@gmail.com>wrote:

> Hello Guys,
>
> I trying to understand how lucene score is calculated. So 'm using the
> searcher.explain() function. But the output it gives is really confusing
> for
> me. Below are the details of the query that I gave and o/p it gave me
>
> Query: *It is definitely a CES deal that will be over in Sep or Oct of this
> year.*
>
> *output*:
>  0.022172567 = (MATCH) product of:
>  0.07760398 = (MATCH) sum of:
>    0.02287053 = (MATCH) weight(payload:ces in 550), product of:
>      0.32539415 = queryWeight(payload:ces), product of:
>        2.2491398 = idf(docFreq=157, maxDocs=551)
>        0.14467494 = queryNorm
>      0.07028562 = (MATCH) fieldWeight(payload:ces in 550), product of:
>        1.0 = tf(termFreq(payload:ces)=1)
>        2.2491398 = idf(docFreq=157, maxDocs=551)
>        0.03125 = fieldNorm(field=payload, doc=550)
>    0.05473345 = (MATCH) weight(payload:deal in 550), product of:
>      0.23803486 = queryWeight(payload:deal), product of:
>        1.6453081 = idf(docFreq=288, maxDocs=551)
>        0.14467494 = queryNorm
>      0.2299388 = (MATCH) fieldWeight(payload:deal in 550), product of:
>        4.472136 = tf(termFreq(payload:deal)=20)
>        1.6453081 = idf(docFreq=288, maxDocs=551)
>        0.03125 = fieldNorm(field=payload, doc=550)
>  0.2857143 = coord(2/7)
>
> So can someone please help me to understand the output or suggest any link
> that explains this output so that I will be grateful.
>
> Regards
> Soby
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message