Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <47B2BCEE.7010601@focuseek.com>
Date: Wed, 13 Feb 2008 10:48:30 +0100
From: Michele Bini <michele@focuseek.com>
User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031)
MIME-Version: 1.0
To: java-dev@lucene.apache.org
Subject: Re: Usefulness of Similarity.queryNorm()
References: <9396E8E7-46FF-4B78-9427-13E9A7E584E4@rectangular.com>
 <129BA615-E1DA-4F31-BFEB-A591340E1285@rectangular.com>
 <7C6FD30D-DF38-4C4C-B776-F3B2F0AAA83F@apache.org>
 <A2E5796D-2DF4-4BC9-81F8-F61AD35DDAAE@rectangular.com>
 <Pine.LNX.4.62.0802130003280.1742@radix.cryptio.net>
In-Reply-To: <Pine.LNX.4.62.0802130003280.1742@radix.cryptio.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Chris Hostetter wrote:
>>> The tf(), idf(), lengthNorm() and queryNorm() are directly from the 
>>> cosine measure, although lengthNorm()'s default implemenation uses an  
>>> approximation.

As I actually found normalized query scores quite useful I decided to 
exit my usual lurk-mode :)

I integrated lucene with carrot2 (more specifically, carrot's lingo 
clustering algorithm, which at its core is based on cosine products) and 
in order to incrementally restrict lucene query to carrot clusters it is 
really essential that the lucene query scores are, more or less, what a 
cosine product would give.

 From my memory, I think I could post process the scores into a cosine 
product using sumOfSquaredWeights() just as Query.weight() does now, but 
my point is slightly different.

 From a library user point of view, I think it's important that lucene 
offers clear, simple hooks to tweak (and even completely change) the 
computed score.

In some cases you need to compute a completely different score and you 
use a ValueSourceQuery. But sometimes you are "lucky" (read: I choose 
lingo for that reason, among the others) as lucene and the clustering 
algorithm were using [nearly] the same score and you don't have to 
compute it again, thus increasing performance.


Just my two cents,
Michele

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org