lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <TheRan...@gmx.net>
Subject Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)
Date Tue, 12 Dec 2006 08:59:28 GMT
Hello Doron (and all the others who read here):),

thank you for your effort and your time. I really appreciate it. :)

I understand why normalisation is done in general. Mainly, to normalise the bias of oversized
documents. In the literature I have read so far, there is usually a high effort on document
normalisation (simply because they may be very different in size), and usually no normalisation
on queries (simply because they are usually only a few words anyway). 

For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with


norm_q : sqrt(sum_t((tf_q*idf_t)^2)) 

which is also called cosine normalisation. This is a technique that is rather comprehensive
and usually used for docuemnts only(!) in all systems I have seen so far.  For the documents
Lucene employs its norm_d_t which is explained as:

norm_d_t : square root of number of tokens in d in the same field as t

basically just the square root of the number of unique terms in the document (since I do search
over all fields always). I would have expected cosine normalisation here... 

The paper you provided uses document normalisation in the following way:

norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))

I am not sure how this relates to norm_d_t. Did I misunderstood the Lucene formula or do I
misinterpret something? I enclosed the formula as a graphic (from LaTeX) so you can have a
look should this be the case here...

I will post the other questions separately since they actually also relate to the new Lucene
scoring algoritm (they have not changed). Thank you for your time again :)

Karl


-------- Original-Nachricht --------
Datum:  Mon, 11 Dec 2006 22:41:56 -0800
Von: Doron Cohen <DORONC@il.ibm.com>
An: java-user@lucene.apache.org
Betreff:  Re:  Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

> > Well it doesn't since there is not justification of why it is the
> > way it is. Its like saying, here is that car with 5 weels... enjoy
> driving.
> 
> > >  - I think the explanations there would also answer at least some of
> your
> > > questions.
> 
> I hoped it would answer *some* of the questions... (not all)
> 
> Let me try some more :-)
> 
> > 2a) Why does Lucene normalise with Cosine Normalisation for the
> > query? In a range of different IR system variations (as shown in
> > Chisholm and Kolda's paper in the table on page 8) queries where not
> > normalised at all. Is there a good reason or perhaps any empirical
> > evidence that would support this decision?
> 
> I think "why normalizing at all" is answered in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
> ... "queryNorm(q) is a normalizing factor used to make scores between
> queries comparable. This factor does not affect document ranking (since
> all
> ranked documents are multiplied by the same factor), but rather just
> attempts to make scores from different queries (or even different indexes)
> comparable. This is a search time factor computed by the Similarity in
> effect at search time. The default computation in DefaultSimilarity is:
> queryNorm(q) = 1/sqrt(sumOfSquaredWeights)"
> 
> I think there were discussions on this normalization. Anyhow I am not the
> expert to justify this.
> 
> > 2b) What is the motivation for the normalisation imposed on the
> > documents (norm_d_t) which I have not seen before in any other
> > system. Again, does anybody have pointers to literature on this?
> 
> If I understand what you are asking, the logic is that verrry long
> documents could virtually contain the entire collection, and therefore
> should be "punished" for being too long, otherwise they would have an
> "unfair advantage" upon short documents. Google "search engine document
> length normalization" for more on this, or see also
> http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf
> 
> > Any answer or partial answer on any of the questions would be
> > greatly appreciated!
> 
> I hope this (although partial) helps at all,
> Doron
-- 
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!


Mime
View raw message