Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 66193 invoked from network); 13 Oct 2004 09:08:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 13 Oct 2004 09:08:51 -0000 Received: (qmail 69690 invoked by uid 500); 13 Oct 2004 09:07:58 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 69658 invoked by uid 500); 13 Oct 2004 09:07:58 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 69641 invoked by uid 99); 13 Oct 2004 09:07:58 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [212.227.126.190] (HELO moutng.kundenserver.de) (212.227.126.190) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 13 Oct 2004 02:07:57 -0700 Received: from [212.227.126.161] (helo=mrelayng.kundenserver.de) by moutng.kundenserver.de with esmtp (Exim 3.35 #1) id 1CHf6v-00048c-00 for lucene-dev@jakarta.apache.org; Wed, 13 Oct 2004 11:07:53 +0200 Received: from [62.245.162.44] (helo=[192.168.10.117]) by mrelayng.kundenserver.de with asmtp (TLSv1:RC4-MD5:128) (Exim 3.35 #1) id 1CHf6v-0003Bg-00 for lucene-dev@jakarta.apache.org; Wed, 13 Oct 2004 11:07:53 +0200 Message-ID: <416CEF67.8010500@detego-software.de> Date: Wed, 13 Oct 2004 11:03:35 +0200 From: Christoph Goller User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.7.3) Gecko/20040914 X-Accept-Language: de, en-us, en, de-at MIME-Version: 1.0 To: Lucene Developers List Subject: Search and Scoring References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: kundenserver.de abuse@kundenserver.de auth:12f525e90d51bb735119ab4626f6800d X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N > As an aside, is there a reason that idf is squared in each Term and > Phrase match (it is multiplied both into the query component and the > field component)? To compensate for this, I'm taking the square root of > the idf I really want in my Similarity, which seems strange. Hi Chuck, that's a very good question. And you are right, it may be a bug, I am not sure about it. I stumbled over this several times when studying code in the search package. It's a little bit difficult to explain since the code for score computation is distributed over Weight and Scorer classes. It seems that a TermQuery and a PhraseQuery weight is multiplied with idf twice, first in sumOfSquaredWeights() and then in normalize. That's what you discovered. The formula in Similarity Javadoc does not describe the scoring completely. I try to write down the formula that exactly describes the current implementation. Then we can start a discussion and people could decide whether this is the intended scoring. (I assume DefaultSimilarity here) Lt's start with the simple case. A pure TermQuery (one word query) gets the following score after cancelling down queryNorm(t) and queryBoost(t) (coord is 1 here) t: TermQuery d: document score(t, d) = tf(t in d) * idf(t) * fieldBoost(t.field in d) * lengthFieldNorm(t.field in d) Note that fieldBoost and lengthNorm are both combined in norms. For a BooleanQuery consisting of several TermQueries we get the following: (Again we can cancel down queryBoost(q)) q: BooleanQuery t: Term and corresponding TermQuery d: document score(q, d) = coord(q, d) * queryNorm(q) * SUM_{t in q} ( tf(t in d) * idf(t)^2 * queryBoost(t) * fieldBoost(t.field in d) * lengthFieldNorm(t.field in d) ) where coord(q, d) = "fraction of TermQueries occuring in d" queryNorm(q) = 1 / SQRT( SUM_{t in q} ( (idf(t) * queryBoost(t) )^2 ) ) I hope this starts a discussion. Christoph --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org