Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 95468 invoked from network); 17 Dec 2004 20:03:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 17 Dec 2004 20:03:07 -0000 Received: (qmail 96781 invoked by uid 500); 17 Dec 2004 19:35:50 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 96709 invoked by uid 500); 17 Dec 2004 19:35:48 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 96663 invoked by uid 99); 17 Dec 2004 19:35:48 -0000 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=SPF_HELO_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from reh001-1.rex001.exchangebyregister.com (HELO reh001-1.REX001.ExchangeByRegister.com) (64.78.19.14) by apache.org (qpsmtpd/0.28) with ESMTP; Fri, 17 Dec 2004 11:34:47 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable Subject: RE: Relevance and ranking ... Date: Fri, 17 Dec 2004 11:32:29 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Relevance and ranking ... Thread-Index: AcTkMRB83FeoegsNRpOgVjXSnct3LAAPP4tQ From: "Chuck Williams" To: "Lucene Users List" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Another issue will likely be the tf() and idf() computations. I have a similar desired relevance ranking and was not getting what I wanted due to the idf() term dominating the score. Lucene squares the contribution of this term, which is not considered best practice in IR. To address these issues, I increased the base of the log for both tf() and idf() (tones them down) and took a final square root on idf(). FYI, here are the definitions I'm using for these methods -- similar definitions should give you the ordering you want. You might want to adjust lengthNorm if you really want it to be linear (square root by default). You should not have to touch coord(). public float tf(float freq) { return 1.0f + (float)Math.log10(freq); } =20 public float idf(int docFreq, int numDocs) { return (float)Math.sqrt(1.0 + Math.log10(numDocs/(double)(docFreq+1))); } =20 Chuck > -----Original Message----- > From: Erik Hatcher [mailto:erik@ehatchersolutions.com] > Sent: Friday, December 17, 2004 4:06 AM > To: Lucene Users List > Subject: Re: Relevance and ranking ... >=20 >=20 > On Dec 17, 2004, at 6:09 AM, Gururaja H wrote: > > Thanks for the reply. Is there any sample code which tells me how to > > change these > > coord() factor, overlapping, lenght normalizaiton etc.. ?? > > If there are any please provide me. >=20 > Have a look at Lucene's DefaultSimilarity code itself. Use that as a > starting point - in fact you should subclass it and only override the > one or two methods you want to tweak. >=20 > There are probably some other examples in Lucene's test cases, or that > have been posted to the list but I don't have handy pointers to them. >=20 > Erik >=20 >=20 > > > > Thanks, > > Gururaja > > > > > > Erik Hatcher wrote: > > The coord() factor of Similarity is what controls a muliplier factor > > for overlapping query terms in a document. The DefaultSimilarity > > already contains factors that allow documents with overlapping terms > to > > get boosted. Is this not working for you? You may also need to adjust > > length normalization factors. Check the javadocs on Similarity for > > details on implementing your own formulas. Also become familiar with > > IndexSearcher.explain() and the Explanation so that you can see how > > adjusting things affects the details. > > > > Erik > > > > On Dec 17, 2004, at 3:42 AM, Gururaja H wrote: > > > >> Hi, > >> > >> How to implement the following ? Please provide inputs .... > >> > >> > >> For example, if the search query has 5 terms (ibm, risc, tape, drive, > >> manual) and there are 4 matching documents with the following > >> attributes, then the order should be as described below. > >> > >> Doc#1: contains terms (ibm, drive) and has a total of 100 terms in > the > >> document. > >> > >> Doc#2: contains terms (ibm, risc, tape, drive) and has a total of 30 > >> terms in the document. > >> > >> Doc#3: contains terms (ibm, risc, tape, drive) and has a total of 100 > >> terms in the document. > >> > >> Doc#4: contains terms (ibm, risc, tape, drive, manual) and has a > total > >> of 300 terms in the document > >> > >> The search results should include all three documents since each has > >> one or more of the search terms, however, the order should be > returned > >> as: > >> > >> Doc#4 > >> > >> Doc#2 > >> > >> Doc#3 > >> > >> Doc#1 > >> > >> Doc#4 should be first, since of the 5 search terms, it contains all 5. > >> > >> Doc#2 should be second, since it has 4 of the 5 search terms and of > >> the number of terms in the document, its ratio is higher than Doc#3 > >> (4/30). Doc#3 has 4 of the 5 terms, but its ratio is 4/100. > >> > >> Doc#1 is last since it only has 2 of the 5 terms. > >> > >> > >> ---- > >> > >> Thanks, > >> Gururaja > >> > >> > >> __________________________________________________ > >> Do You Yahoo!? > >> Tired of spam? Yahoo! Mail has the best spam protection around > >> http://mail.yahoo.com > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > > > > --------------------------------- > > Do you Yahoo!? > > Send holiday email and support a worthy cause. Do good. >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org