Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 59336 invoked from network); 7 Aug 2010 18:39:11 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Aug 2010 18:39:11 -0000 Received: (qmail 15086 invoked by uid 500); 7 Aug 2010 18:39:09 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 15006 invoked by uid 500); 7 Aug 2010 18:39:08 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 14998 invoked by uid 99); 7 Aug 2010 18:39:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Aug 2010 18:39:08 +0000 X-ASF-Spam-Status: No, hits=4.4 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of soby.thomas85@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Aug 2010 18:39:00 +0000 Received: by qwd7 with SMTP id 7so8688614qwd.35 for ; Sat, 07 Aug 2010 11:38:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=49T4haeGwptKVL6r9iDLhD3B/kz7LrURuUmY7GKO/28=; b=vYJ9xjMbPYAU6PVfRTk2GQULvjLLRVbMK5iuuJVbvLNyr4Bnu68LzAIzM2bYEtCkHu sNLMVzaS+hAlzwdMyHhUt4yTvoXdVKbVT0dv3YsVYKHBXT/7911tZs5vI26XL7Vw0/nH Lm2+Lt5OdRu6RQ4Pi91rpmkE3a2UGP3NHa5rQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=LlcFFcGJyug3ydazjsq9TBfgUtdJCxTr9XKgBo2iUEf+CsPXY+rsm0fKN9Dldz7Yam 4CnC0+YoSCs43NC3sOnqjUyp9Eo1kDOZsOpeU1wTDwwb+aIGVIVGK9BiICRFUS/OVaGY mnIfcdOKqbdwhzloNgJ5vNq69W496jyROx8qs= MIME-Version: 1.0 Received: by 10.229.245.16 with SMTP id ls16mr5826505qcb.130.1281206314737; Sat, 07 Aug 2010 11:38:34 -0700 (PDT) Received: by 10.229.250.69 with HTTP; Sat, 7 Aug 2010 11:38:34 -0700 (PDT) In-Reply-To: References: Date: Sat, 7 Aug 2010 19:38:34 +0100 Message-ID: Subject: Re: Need help in understanding output of searcher.explain() function From: Soby Thomas To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001485f5eb10462057048d4017cd X-Virus-Checked: Checked by ClamAV on apache.org --001485f5eb10462057048d4017cd Content-Type: text/plain; charset=ISO-8859-1 thanks Jayendra...it was really helpful On Sat, Aug 7, 2010 at 6:07 PM, jayendra patil wrote: > Trying to put up an explanation :- > > 0.022172567 = (MATCH) product of: > 0.07760398 = (MATCH) sum of: > 0.02287053 = (MATCH) weight(payload:ces in 550), product of: > 0.32539415 = queryWeight(payload:ces), product of: > 2.2491398 = *idf*(docFreq=157, maxDocs=551) > 0.14467494 = queryNorm > 0.07028562 = (MATCH) fieldWeight(payload:ces in 550), product of: > 1.0 = *tf(*termFreq(payload:ces)=1) > 2.2491398 = *idf(*docFreq=157, maxDocs=551) > 0.03125 = *fieldNorm*(field=payload, doc=550) > 0.05473345 = (MATCH) weight(payload:deal in 550), product of: > 0.23803486 = queryWeight(payload:deal), product of: > 1.6453081 = *idf(*docFreq=288, maxDocs=551) > 0.14467494 = *queryNorm* > 0.2299388 = (MATCH) fieldWeight(payload:deal in 550), product of: > 4.472136 = tf(termFreq(payload:deal)=20) > 1.6453081 = idf(docFreq=288, maxDocs=551) > 0.03125 = fieldNorm(field=payload, doc=550) > 0.2857143 = coord(2/7) > > > 1. tf = term frequency in document = measure of how often a term appears > in the document > 1. > > Implementation: sqrt(freq) > > Implication: the more frequent a term occurs in a document, the > greater its score > > Rationale: documents which contains more of a term are generally more > relevant > 2. idf = inverse document frequency = measure of how often the term > appears across the index > 1. > > Implementation: log(numDocs/(docFreq+1)) + 1 > > Implication: the greater the occurrence of a term in different > documents, the lower its score > > Rationale: common terms are less important than uncommon ones > 3. coord = number of terms in the query that were found in the > document > 1. > > Implementation: overlap / maxOverlap > > Implication: of the terms in the query, a document that contains more > terms will have a higher score > > Rationale: self-explanatory > 4. fieldNorm > 1. lengthNorm = measure of the importance of a term according to the > total number of terms in the field > 1. Implementation: 1/sqrt(numTerms) > 2. Implication: a term matched in fields with less terms have a > higher score > 3. Rationale: a term in a field with less terms is more important > than one with more > 2. boost (index) = boost of the field at index-time > 1. Index time boost specified. The fieldNorm value in the score > would include the same. > 3. boost (query) = boost of the field at query-time > 5. queryNorm = normalization factor so that queries can be compared > 1. queryNorm is not related to the relevance of the document, but > rather tries to make scores between different queries comparable. It > is > implemented as 1/sqrt(sumOfSquaredWeights) > > > When you are trying to search for Query: *It is definitely a CES deal that > will be over in Sep or Oct of this year.* > > 1. Lucene would try to match each word in our query in each field that you > have specified to be searched on e.g. payload in your case. > 2. In your example, it found match only on ces and deal, hence only the two > items are displayed. > 3. The number of matches in the particular field also contributes to > the 0.2857143 = coord(*2*/7) - 2 words out of 7 > 4. *idf*(docFreq=157, maxDocs=551) - specified the rarity. The docfreq > specifies the number of documents which have the word in the field with the > maxdocs represents the total number of documents. > 5. *tf(*termFreq(payload:ces)=1) - Specifies the number of times it occurs > e.g. 1 in this case. > 6. The Score for each field match is the product of the > > 0.02287053 = (MATCH) weight(payload:ces in 550), product of: > > Field boost and idf > > 0.32539415 = queryWeight(payload:ces), product of: > > * 1 = boost (**The boost if your case seems to be 1 and hence is not > included in the score.**)* > > 2.2491398 = idf(docFreq=157, maxDocs=551) > > 0.14467494 = queryNorm > > term frequency, idf and field norm > > 0.07028562 = (MATCH) fieldWeight(payload:ces in 550), product of: > > 1.0 = *tf(*termFreq(payload:ces)=1) > > 2.2491398 = *idf(*docFreq=157, maxDocs=551) > > 0.03125 = *fieldNorm*(field=payload, doc=550) > > > > Regards, > Jayendra > > On Sat, Aug 7, 2010 at 11:02 AM, Soby Thomas >wrote: > > > Hello Guys, > > > > I trying to understand how lucene score is calculated. So 'm using the > > searcher.explain() function. But the output it gives is really confusing > > for > > me. Below are the details of the query that I gave and o/p it gave me > > > > Query: *It is definitely a CES deal that will be over in Sep or Oct of > this > > year.* > > > > *output*: > > 0.022172567 = (MATCH) product of: > > 0.07760398 = (MATCH) sum of: > > 0.02287053 = (MATCH) weight(payload:ces in 550), product of: > > 0.32539415 = queryWeight(payload:ces), product of: > > 2.2491398 = idf(docFreq=157, maxDocs=551) > > 0.14467494 = queryNorm > > 0.07028562 = (MATCH) fieldWeight(payload:ces in 550), product of: > > 1.0 = tf(termFreq(payload:ces)=1) > > 2.2491398 = idf(docFreq=157, maxDocs=551) > > 0.03125 = fieldNorm(field=payload, doc=550) > > 0.05473345 = (MATCH) weight(payload:deal in 550), product of: > > 0.23803486 = queryWeight(payload:deal), product of: > > 1.6453081 = idf(docFreq=288, maxDocs=551) > > 0.14467494 = queryNorm > > 0.2299388 = (MATCH) fieldWeight(payload:deal in 550), product of: > > 4.472136 = tf(termFreq(payload:deal)=20) > > 1.6453081 = idf(docFreq=288, maxDocs=551) > > 0.03125 = fieldNorm(field=payload, doc=550) > > 0.2857143 = coord(2/7) > > > > So can someone please help me to understand the output or suggest any > link > > that explains this output so that I will be grateful. > > > > Regards > > Soby > > > --001485f5eb10462057048d4017cd--