lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: A question about scoring function in Lucene
Date Wed, 15 Dec 2004 17:05:12 GMT
Nhan,

You are correct that dropping the document norm does cause Lucene's scoring model to deviate
from the pure vector space model.  However, including norm_d would cause other problems --
e.g., with short queries, as are typical in reality, the resulting scores with norm_d would
all be extremely small.  You are also correct that since norm_q is invariant, it does not
affect relevance ranking.  Norm_q is simply part of the normalization of final scores.  There
are many different formulas for scoring and relevance ranking in IR.  All of these have some
intuitive justification, but in the end can only be evaluated empirically.  There is no "correct"
formula.

I believe the biggest problem with Lucene's approach relative to the pure vector space model
is that Lucene does not properly normalize.  The pure vector space model implements a cosine
in the strictly positive sector of the coordinate space.  This is guaranteed intrinsically
to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e.,
"0.8" means something about the result quality independent of the query).

Lucene does not have this property.  Its formula produces scores of arbitrary magnitude depending
on the query.  The results cannot be compared meaningfully across queries; i.e., "0.8" means
nothing intrinsically.  To keep final scores between 0 and 1, Lucene introduces an ad hoc
query-dependent final normalization in Hits:  viz., it divides all scores by the highest score
if the highest score happens to be greater than 1.  This makes it impossible for an application
to properly inform its users about the quality of the results, to cut off bad results, etc.
 Applications may do that, but in fact what they are doing is random, not what they think
they are doing.

I've proposed a fix for this -- there was a long thread on Lucene-dev.  It is possible to
revise Lucene's scoring to keep its efficiency, keep its current per-query relevance ranking,
and yet intrinsically normalize its scores so that they are meaningful across queries.  I
posted a fairly detailed spec of how to do this in the Lucene-dev thread.  I'm hoping to have
time to build it and submit it as a proposed update to Lucene, but it is a large effort that
would involve changing just about every scoring class in Lucene.  I'm not sure it would be
incorporated even if I did it as that would take considerable work from a developer.  There
doesn't seem to be much concern about these various scoring and relevancy ranking issues among
the general Lucene community.

Chuck

  > -----Original Message-----
  > From: Nhan Nguyen Dang [mailto:ndnhan2003@yahoo.com]
  > Sent: Wednesday, December 15, 2004 1:18 AM
  > To: Lucene Users List
  > Subject: RE: A question about scoring function in Lucene
  > 
  > Thank for your answer,
  > In Lucene scoring function, they use only norm_q,
  > but for one query, norm_q is the same for all
  > documents.
  > So norm_q is actually not effect the score.
  > But norm_d is different, each document has a different
  > norm_d; it effect the score of document d for query q.
  > If you drop it, the score information is not correct
  > anymore or it not space vector model anymore.  Could
  > you explain it a little bit.
  > 
  > I think that it's expensive to computed in incremetal
  > indexing because when one document is added, idf of
  > each term changed. But drop it is not a good choice.
  > 
  > What is the role of norm_d_t ?
  > Nhan.
  > 
  > --- Chuck Williams <chuck@manawiz.com> wrote:
  > 
  > > Nhan,
  > >
  > > Re.  your two differences:
  > >
  > > 1 is not a difference.  Norm_d and Norm_q are both
  > > independent of t, so summing over t has no effect on
  > > them.  I.e., Norm_d * Norm_q is constant wrt the
  > > summation, so it doesn't matter if the sum is over
  > > just the numerator or over the entire fraction, the
  > > result is the same.
  > >
  > > 2 is a difference.  Lucene uses Norm_q instead of
  > > Norm_d because Norm_d is too expensive to compute,
  > > especially in the presence of incremental indexing.
  > > E.g., adding or deleting any document changes the
  > > idf's, so if Norm_d was used it would have to be
  > > recomputed for ALL documents.  This is not feasible.
  > >
  > > Another point you did not mention is that the idf
  > > term is squared (in both of your formulas).  Salton,
  > > the originator of the vector space model, dropped
  > > one idf factor from his formula as it improved
  > > results empirically.  More recent theoretical
  > > justifications of tf*idf provide intuitive
  > > explanations of why idf should only be included
  > > linearly.  tf is best thought of as the real vector
  > > entry, while idf is a weighting term on the
  > > components of the inner product.  E.g., seen the
  > > excellent paper by Robertson, "Understanding inverse
  > > document frequency: on theoretical arguments for
  > > IDF", available here:
  > > http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
  > > if you sign up for an eval.
  > >
  > > It's easy to correct for idf^2 by using a customer
  > > Similarity that takes a final square root.
  > >
  > > Chuck
  > >
  > >   > -----Original Message-----
  > >   > From: Vikas Gupta [mailto:vgupta@cs.utexas.edu]
  > >   > Sent: Tuesday, December 14, 2004 9:32 PM
  > >   > To: Lucene Users List
  > >   > Subject: Re: A question about scoring function
  > > in Lucene
  > >   >
  > >   > Lucene uses the vector space model. To
  > > understand that:
  > >   >
  > >   > -Read section 2.1 of "Space optimizations for
  > > Total Ranking" paper
  > >   > (Linked
  > >   > here
  > > http://lucene.sourceforge.net/publications.html)
  > >   > -Read section 6 to 6.4 of
  > >   >
  > >
  > http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
  > >   > -Read section 1 of
  > >   >
  > >
  > http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps
  > >   >
  > >   > Vikas
  > >   >
  > >   > On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:
  > >   >
  > >   > > Hi all,
  > >   > > Lucene score document based on the correlation
  > > between
  > >   > > the query q and document t:
  > >   > > (this is raw function, I don't pay attention
  > > to the
  > >   > > boost_t, coord_q_d factor)
  > >   > >
  > >   > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d
  > > * idf_t
  > >   > > / norm_d_t)  (*)
  > >   > >
  > >   > > Could anybody explain it in detail ? Or are
  > > there any
  > >   > > papers, documents about this function ?
  > > Because:
  > >   > >
  > >   > > I have also read the book: Modern Information
  > >   > > Retrieval, author: Ricardo Baeza-Yates and
  > > Berthier
  > >   > > Ribeiro-Neto, Addison Wesley (Hope you have
  > > read it
  > >   > > too). In page 27, they also suggest a scoring
  > > funtion
  > >   > > for vector model based on the correlation
  > > between
  > >   > > query q and document d as follow (I use
  > > different
  > >   > > symbol):
  > >   > >
  > >   > > 	         sum_t( weight_t_d * weight_t_q)
  > >   > > score_d(d, q)=
  > > --------------------------------- (**)
  > >   > > 	     	      norm_d * norm_q
  > >   > >
  > >   > > where weight_t_d = tf_d * idf_t
  > >   > >       weight_t_q = tf_q * idf_t
  > >   > >       norm_d = sqrt( sum_t( (tf_d * idf_t)^2 )
  > > )
  > >   > >       norm_q = sqrt( sum_t( (tf_q * idf_t)^2 )
  > > )
  > >   > >
  > >   > > (**):          sum_t( tf_q*idf_t * tf_d*idf_t)
  > >   > > score_d(d,
  > > q)=---------------------------------  (***)
  > >   > > 		   norm_d * norm_q
  > >   > >
  > >   > > The two function, (*) and (***), have 2
  > > differences:
  > >   > > 1. in (***), the sum_t is just for the
  > > numerator but
  > >   > > in the (*), the sum_t is for everything. So,
  > > with
  > >   > > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
  > >   > > calculated twice. Is this right? please
  > > explain.
  > >   > >
  > >   > > 2. No factor that define norms of the
  > > document: norm_d
  > >   > > in the function (*). Can you explain this.
  > > what is the
  > >   > > role of factor norm_d_t ?
  > >   > >
  > >   > > One more question: could anybody give me
  > > documents,
  > >   > > papers that explain this function in detail.
  > > so when I
  > >   > > apply Lucene for my system, I can adapt the
  > > document,
  > >   > > and the field so that I still receive the
  > > correct
  > >   > > scoring information from Lucene .
  > >   > >
  > >   > > Best regard,
  > >   > > Thanks every body,
  > >   > >
  > >   > > =====
  > >   > > Ð#7863;ng Nhân
  > >   >
  > >   >
  > >
  > ---------------------------------------------------------------------
  > >   > To unsubscribe, e-mail:
  > > lucene-user-unsubscribe@jakarta.apache.org
  > >   > For additional commands, e-mail:
  > > lucene-user-help@jakarta.apache.org
  > >
  > >
  > >
  > ---------------------------------------------------------------------
  > > To unsubscribe, e-mail:
  > > lucene-user-unsubscribe@jakarta.apache.org
  > > For additional commands, e-mail:
  > > lucene-user-help@jakarta.apache.org
  > >
  > >
  > 
  > 
  > =====
  > Ð#7863;ng Nhân
  > 
  > 
  > 
  > 
  > 
  > 
  > __________________________________
  > Do you Yahoo!?
  > Yahoo! Mail - Helps protect you from nasty viruses.
  > http://promotions.yahoo.com/new_mail
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message