Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 86881 invoked from network); 15 Dec 2004 05:32:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 15 Dec 2004 05:32:16 -0000 Received: (qmail 67034 invoked by uid 500); 15 Dec 2004 05:32:11 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 66995 invoked by uid 500); 15 Dec 2004 05:32:11 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 66980 invoked by uid 99); 15 Dec 2004 05:32:10 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: 128.83.139.10 is neither permitted nor denied by domain of vgupta@cs.utexas.edu) Received: from mail.cs.utexas.edu (HELO mail.cs.utexas.edu) (128.83.139.10) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 14 Dec 2004 21:32:06 -0800 Received: from cofferdam.cs.utexas.edu (vgupta@cofferdam.cs.utexas.edu [128.83.144.228]) by mail.cs.utexas.edu (8.13.1/8.13.1) with ESMTP id iBF5W3xW022743 for ; Tue, 14 Dec 2004 23:32:04 -0600 (CST) Received: (from vgupta@localhost) by cofferdam.cs.utexas.edu (8.12.11/8.12.11/Submit) id iBF5W3qH010812; Tue, 14 Dec 2004 23:32:03 -0600 Date: Tue, 14 Dec 2004 23:32:03 -0600 (CST) From: Vikas Gupta To: Lucene Users List Subject: Re: A question about scoring function in Lucene In-Reply-To: <20041215051948.44156.qmail@web14822.mail.yahoo.com> Message-ID: References: <20041215051948.44156.qmail@web14822.mail.yahoo.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN Content-Transfer-Encoding: QUOTED-PRINTABLE X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Lucene uses the vector space model. To understand that: -Read section 2.1 of "Space optimizations for Total Ranking" paper (Linked here http://lucene.sourceforge.net/publications.html) -Read section 6 to 6.4 of http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf -Read section 1 of http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps Vikas On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: > Hi all, > Lucene score document based on the correlation between > the query q and document t: > (this is raw function, I don't pay attention to the > boost_t, coord_q_d factor) > > score_d =3D sum_t( tf_q * idf_t / norm_q * tf_d * idf_t > / norm_d_t) (*) > > Could anybody explain it in detail ? Or are there any > papers, documents about this function ? Because: > > I have also read the book: Modern Information > Retrieval, author: Ricardo Baeza-Yates and Berthier > Ribeiro-Neto, Addison Wesley (Hope you have read it > too). In page 27, they also suggest a scoring funtion > for vector model based on the correlation between > query q and document d as follow (I use different > symbol): > > =09 sum_t( weight_t_d * weight_t_q) > score_d(d, q)=3D --------------------------------- (**) > =09 =09 norm_d * norm_q > > where weight_t_d =3D tf_d * idf_t > weight_t_q =3D tf_q * idf_t > norm_d =3D sqrt( sum_t( (tf_d * idf_t)^2 ) ) > norm_q =3D sqrt( sum_t( (tf_q * idf_t)^2 ) ) > > (**): sum_t( tf_q*idf_t * tf_d*idf_t) > score_d(d, q)=3D--------------------------------- (***) > =09=09 norm_d * norm_q > > The two function, (*) and (***), have 2 differences: > 1. in (***), the sum_t is just for the numerator but > in the (*), the sum_t is for everything. So, with > norm_q =3D sqrt(sum_t((tf_q*idf_t)^2)); sum_t is > calculated twice. Is this right? please explain. > > 2. No factor that define norms of the document: norm_d > in the function (*). Can you explain this. what is the > role of factor norm_d_t ? > > One more question: could anybody give me documents, > papers that explain this function in detail. so when I > apply Lucene for my system, I can adapt the document, > and the field so that I still receive the correct > scoring information from Lucene . > > Best regard, > Thanks every body, > > =3D=3D=3D=3D=3D > =D0#7863;ng Nh=E2n --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org