Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of soeren.pekrul@gmx.de
 designates 213.165.64.20 as permitted sender)
Message-ID: <457E8A0A.5080708@gmx.de>
Date: Tue, 12 Dec 2006 11:52:58 +0100
From: Soeren Pekrul <soeren.pekrul@gmx.de>
User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula
 needed)
References: 
 <OFAFD62F89.4391B508-ON88257242.00218077-88257242.0024CC6E@il.ibm.com>
 <20061212085928.212210@gmx.net>
In-Reply-To: <20061212085928.212210@gmx.net>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Hello Karl,

I’m very interested in the details of Lucene’s scoring as well.

Karl Koch wrote:
> For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with 
> 
> norm_q : sqrt(sum_t((tf_q*idf_t)^2))
> 
> which is also called cosine normalisation. This is a technique that is rather comprehensive and usually used for docuemnts only(!) in all systems I have seen so far.

I hope I have understood 
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm 
and your problem correctly: "queryNorm(q) is a normalizing factor used 
to make scores between queries comparable."

For "normal" searches you don’t need to compare queries. You have just 
to compare the documents of a single query. Queries in a "normal" search 
have usually a different semantic, so you can’t really compare the 
results of different queries.

If you use Lucene for instance for classification of documents it is 
necessary to compare the results of different queries. You have 
documents to classify indexed at one site and the classes at the other 
side (thread "Store a document-like map" 
http://www.gossamer-threads.com/lists/lucene/java-user/42816). Than you 
can generate queries from the classes and search against the documents. 
The score of a matching document is the similarity of the document to 
the query build from the class. Now the queries have to be comparable.

You can transform a document into a query and a query into a document. 
That could be the reason normalizing a query like a document.

> For the documents Lucene employs its norm_d_t which is explained as:
> 
> norm_d_t : square root of number of tokens in d in the same field as t
> 
> basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... 
> 
> The paper you provided uses document normalisation in the following way:
> 
> norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))
> 
> I am not sure how this relates to norm_d_t.

"norm(t,d)   =   doc.getBoost()  •  lengthNorm(field)  •  ∏ f.getBoost()"
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm)

That seems to be in depended of the documents length. The factor 
lengthNorm(field) uses the documents length or better the field length: 
"Computes the normalization value for a field given the total number of 
terms contained in a field." 
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm).

"Implemented as 1/sqrt(numTerms)" 
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/DefaultSimilarity.html#lengthNorm(java.lang.String,%20int))

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org