Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 60426 invoked from network); 11 Apr 2006 08:43:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 Apr 2006 08:43:18 -0000 Received: (qmail 59156 invoked by uid 500); 11 Apr 2006 08:43:12 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 59131 invoked by uid 500); 11 Apr 2006 08:43:11 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 59113 invoked by uid 99); 11 Apr 2006 08:43:11 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Apr 2006 01:43:11 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of NYH@il.ibm.com designates 195.212.29.152 as permitted sender) Received: from [195.212.29.152] (HELO mtagate3.de.ibm.com) (195.212.29.152) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Apr 2006 01:43:10 -0700 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate3.de.ibm.com (8.13.6/8.13.6) with ESMTP id k3B8gm4Q116344 for ; Tue, 11 Apr 2006 08:42:48 GMT Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) by d12nrmr1607.megacenter.de.ibm.com (8.12.10/NCO/VER6.8) with ESMTP id k3B8hfUJ232482 for ; Tue, 11 Apr 2006 10:43:41 +0200 Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av04.megacenter.de.ibm.com (8.12.11/8.13.3) with ESMTP id k3B8gmxY008604 for ; Tue, 11 Apr 2006 10:42:48 +0200 Received: from d12mc102.megacenter.de.ibm.com (d12mc102.megacenter.de.ibm.com [9.149.167.114]) by d12av04.megacenter.de.ibm.com (8.12.11/8.12.11) with ESMTP id k3B8gmv0008599 for ; Tue, 11 Apr 2006 10:42:48 +0200 In-Reply-To: <022f01c65ccf$0c7e0950$6fe9a8c0@ict2011> Subject: Re: Small field indexing and ranking To: java-user@lucene.apache.org X-Mailer: Lotus Notes Release 7.0 HF144 February 01, 2006 Message-ID: From: "Nadav Har'El" Date: Tue, 11 Apr 2006 11:33:23 +0300 X-MIMETrack: Serialize by Router on D12MC102/12/M/IBM(Release 7.0HF90 | November 16, 2005) at 11/04/2006 11:43:40 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N "Maxym Mykhalchuk" wrote on 10/04/2006 09:46:16 PM: > Here's the issue: All my "documents" will be having a few (2-3: > title, short description) short fields. You see, it's rare that the > same word is repeated several times in a title, so will Lucene be > able to give me a decent ranking, or will it be able to tell me "oh, > yes, this term is in the following 300 titles". > > On what I've read on the topic so far, it seems that inverted > indexes do work good on big texts, as they are able to exploit the > repetition of words to do ranking. Lucene is no psychic. If you're looking for "dog", and the document contains two short documents, actually titles: "Sparky the Fire Dog" and "Dog Hause Home Page" (just two silly titles from Google's top 10 results for "dog"...) Then there's hardly any way for Lucene to determine which document should be ranked higher. For single word queries in a situation like this, you might want to help Lucene learn the "good" ranking. One way is to use Document.setBoost() (or Field.setBoost) to pre-determine which document is more "important" regardless of its text (e.g., using some sort of link analysis, or whatever trick that is applicable in your situation). Another way is to override Lucene's relevance ranking with some other type of sorting (see the Sort class) - for example, to sort all the matching results by date, to get the newer matching results first. In many applications, you might want to let your users control this sort order; For example, in a shopping site (where product names are the very short "documents"), you might want to let the user sort the results by price, by popularity, by release date, by users' ranking, and so on. For multi-word queries, it is actually possible to improve on Lucene's standard ranking. For example, let's say you have the two titles "Hot Dog on a Stick" "Your Dog in Hot Weather" And get a query "hot dog" (without quotation marks). Using QueryParser, Lucene will normally rank the two titles more or less the same. However, the first one is probably much better because the words "hot" and "dog", don't just appear there, they actually appear very close, and in this case even in order. This sort of proximity-influenced scoring is missing from Lucene's QueryParser, and I've been wondering recently on how it is best to add it, and whether it is possible to easily do it with existing Lucene machinary, like the SpanQuery class. Has anyone ever tried to do something like this before, and can tell us their experience? Good Luck, Nadav. -- Nadav Har'El --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org