Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 27652 invoked from network); 12 Dec 2006 10:53:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Dec 2006 10:53:53 -0000 Received: (qmail 69740 invoked by uid 500); 12 Dec 2006 10:53:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 69711 invoked by uid 500); 12 Dec 2006 10:53:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 69700 invoked by uid 99); 12 Dec 2006 10:53:55 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Dec 2006 02:53:55 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of soeren.pekrul@gmx.de designates 213.165.64.20 as permitted sender) Received: from [213.165.64.20] (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 12 Dec 2006 02:53:44 -0800 Received: (qmail invoked by alias); 12 Dec 2006 10:53:22 -0000 Received: from p548C7FCA.dip.t-dialin.net (EHLO [10.0.1.102]) [84.140.127.202] by mail.gmx.net (mp047) with SMTP; 12 Dec 2006 11:53:22 +0100 X-Authenticated: #3493418 Message-ID: <457E8A0A.5080708@gmx.de> Date: Tue, 12 Dec 2006 11:52:58 +0100 From: Soeren Pekrul User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: de-DE, de, en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed) References: <20061212085928.212210@gmx.net> In-Reply-To: <20061212085928.212210@gmx.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Y-GMX-Trusted: 0 X-Virus-Checked: Checked by ClamAV on apache.org Hello Karl, I’m very interested in the details of Lucene’s scoring as well. Karl Koch wrote: > For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with > > norm_q : sqrt(sum_t((tf_q*idf_t)^2)) > > which is also called cosine normalisation. This is a technique that is rather comprehensive and usually used for docuemnts only(!) in all systems I have seen so far. I hope I have understood http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm and your problem correctly: "queryNorm(q) is a normalizing factor used to make scores between queries comparable." For "normal" searches you don’t need to compare queries. You have just to compare the documents of a single query. Queries in a "normal" search have usually a different semantic, so you can’t really compare the results of different queries. If you use Lucene for instance for classification of documents it is necessary to compare the results of different queries. You have documents to classify indexed at one site and the classes at the other side (thread "Store a document-like map" http://www.gossamer-threads.com/lists/lucene/java-user/42816). Than you can generate queries from the classes and search against the documents. The score of a matching document is the similarity of the document to the query build from the class. Now the queries have to be comparable. You can transform a document into a query and a query into a document. That could be the reason normalizing a query like a document. > For the documents Lucene employs its norm_d_t which is explained as: > > norm_d_t : square root of number of tokens in d in the same field as t > > basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... > > The paper you provided uses document normalisation in the following way: > > norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d)) > > I am not sure how this relates to norm_d_t. "norm(t,d) = doc.getBoost() • lengthNorm(field) • ∏ f.getBoost()" (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm) That seems to be in depended of the documents length. The factor lengthNorm(field) uses the documents length or better the field length: "Computes the normalization value for a field given the total number of terms contained in a field." (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm). "Implemented as 1/sqrt(numTerms)" (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/DefaultSimilarity.html#lengthNorm(java.lang.String,%20int)) Sören --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org