lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: A question about scoring function in Lucene
Date Wed, 15 Dec 2004 06:00:11 GMT
Nhan,

Re.  your two differences:

1 is not a difference.  Norm_d and Norm_q are both independent of t, so summing over t has
no effect on them.  I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter
if the sum is over just the numerator or over the entire fraction, the result is the same.

2 is a difference.  Lucene uses Norm_q instead of Norm_d because Norm_d is too expensive to
compute, especially in the presence of incremental indexing.  E.g., adding or deleting any
document changes the idf's, so if Norm_d was used it would have to be recomputed for ALL documents.
 This is not feasible.

Another point you did not mention is that the idf term is squared (in both of your formulas).
 Salton, the originator of the vector space model, dropped one idf factor from his formula
as it improved results empirically.  More recent theoretical justifications of tf*idf provide
intuitive explanations of why idf should only be included linearly.  tf is best thought of
as the real vector entry, while idf is a weighting term on the components of the inner product.
 E.g., seen the excellent paper by Robertson, "Understanding inverse document frequency: on
theoretical arguments for IDF", available here:  http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
if you sign up for an eval.

It's easy to correct for idf^2 by using a customer Similarity that takes a final square root.

Chuck

  > -----Original Message-----
  > From: Vikas Gupta [mailto:vgupta@cs.utexas.edu]
  > Sent: Tuesday, December 14, 2004 9:32 PM
  > To: Lucene Users List
  > Subject: Re: A question about scoring function in Lucene
  > 
  > Lucene uses the vector space model. To understand that:
  > 
  > -Read section 2.1 of "Space optimizations for Total Ranking" paper
  > (Linked
  > here http://lucene.sourceforge.net/publications.html)
  > -Read section 6 to 6.4 of
  > http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
  > -Read section 1 of
  > http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps
  > 
  > Vikas
  > 
  > On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:
  > 
  > > Hi all,
  > > Lucene score document based on the correlation between
  > > the query q and document t:
  > > (this is raw function, I don't pay attention to the
  > > boost_t, coord_q_d factor)
  > >
  > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
  > > / norm_d_t)  (*)
  > >
  > > Could anybody explain it in detail ? Or are there any
  > > papers, documents about this function ? Because:
  > >
  > > I have also read the book: Modern Information
  > > Retrieval, author: Ricardo Baeza-Yates and Berthier
  > > Ribeiro-Neto, Addison Wesley (Hope you have read it
  > > too). In page 27, they also suggest a scoring funtion
  > > for vector model based on the correlation between
  > > query q and document d as follow (I use different
  > > symbol):
  > >
  > > 	         sum_t( weight_t_d * weight_t_q)
  > > score_d(d, q)=  --------------------------------- (**)
  > > 	     	      norm_d * norm_q
  > >
  > > where weight_t_d = tf_d * idf_t
  > >       weight_t_q = tf_q * idf_t
  > >       norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
  > >       norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )
  > >
  > > (**):          sum_t( tf_q*idf_t * tf_d*idf_t)
  > > score_d(d, q)=---------------------------------  (***)
  > > 		   norm_d * norm_q
  > >
  > > The two function, (*) and (***), have 2 differences:
  > > 1. in (***), the sum_t is just for the numerator but
  > > in the (*), the sum_t is for everything. So, with
  > > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
  > > calculated twice. Is this right? please explain.
  > >
  > > 2. No factor that define norms of the document: norm_d
  > > in the function (*). Can you explain this. what is the
  > > role of factor norm_d_t ?
  > >
  > > One more question: could anybody give me documents,
  > > papers that explain this function in detail. so when I
  > > apply Lucene for my system, I can adapt the document,
  > > and the field so that I still receive the correct
  > > scoring information from Lucene .
  > >
  > > Best regard,
  > > Thanks every body,
  > >
  > > =====
  > > Ð#7863;ng Nhân
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message