Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 29075 invoked from network); 13 Feb 2008 09:49:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Feb 2008 09:49:06 -0000 Received: (qmail 34754 invoked by uid 500); 13 Feb 2008 09:48:58 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 34703 invoked by uid 500); 13 Feb 2008 09:48:58 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 34692 invoked by uid 99); 13 Feb 2008 09:48:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2008 01:48:58 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [62.149.203.225] (HELO focuseek.com) (62.149.203.225) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2008 09:48:26 +0000 Received: from hector.cb.focuseek.com (adsl-70-19.38-151.net24.it [151.38.19.70]) by focuseek.com (Postfix) with ESMTP id BE46CBD0854 for ; Wed, 13 Feb 2008 10:48:31 +0100 (CET) Message-ID: <47B2BCEE.7010601@focuseek.com> Date: Wed, 13 Feb 2008 10:48:30 +0100 From: Michele Bini User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Usefulness of Similarity.queryNorm() References: <9396E8E7-46FF-4B78-9427-13E9A7E584E4@rectangular.com> <129BA615-E1DA-4F31-BFEB-A591340E1285@rectangular.com> <7C6FD30D-DF38-4C4C-B776-F3B2F0AAA83F@apache.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Chris Hostetter wrote: >>> The tf(), idf(), lengthNorm() and queryNorm() are directly from the >>> cosine measure, although lengthNorm()'s default implemenation uses an >>> approximation. As I actually found normalized query scores quite useful I decided to exit my usual lurk-mode :) I integrated lucene with carrot2 (more specifically, carrot's lingo clustering algorithm, which at its core is based on cosine products) and in order to incrementally restrict lucene query to carrot clusters it is really essential that the lucene query scores are, more or less, what a cosine product would give. From my memory, I think I could post process the scores into a cosine product using sumOfSquaredWeights() just as Query.weight() does now, but my point is slightly different. From a library user point of view, I think it's important that lucene offers clear, simple hooks to tweak (and even completely change) the computed score. In some cases you need to compute a completely different score and you use a ValueSourceQuery. But sometimes you are "lucky" (read: I choose lingo for that reason, among the others) as lucene and the clustering algorithm were using [nearly] the same score and you don't have to compute it again, thus increasing performance. Just my two cents, Michele --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org