Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 65343 invoked from network); 11 Apr 2006 08:53:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 Apr 2006 08:53:55 -0000 Received: (qmail 71786 invoked by uid 500); 11 Apr 2006 08:53:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 71766 invoked by uid 500); 11 Apr 2006 08:53:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 71755 invoked by uid 99); 11 Apr 2006 08:53:47 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Apr 2006 01:53:47 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [193.205.194.4] (HELO dit.unitn.it) (193.205.194.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Apr 2006 01:53:46 -0700 Received: from ict2011 (ict20-11.science.unitn.it [192.168.233.111]) by dit.unitn.it (8.12.11.20060308/8.12.11) with SMTP id k3B8rMm3020381 for ; Tue, 11 Apr 2006 10:53:22 +0200 Message-ID: <008501c65d45$36b27730$6fe9a8c0@ict2011> From: "Maxym Mykhalchuk" To: References: Subject: Re: Small field indexing and ranking Date: Tue, 11 Apr 2006 10:52:07 +0200 Organization: DIT MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.2670 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2670 X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi Nadav, Thanks for suggestions. As for improving multi-word queries, Doug Cutting recently posted a link to his presentation, http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf, just scroll down to Nutch N-Grams there, and you'll see the answer. Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the, the-vampire+0, vampire; but that's only for common terms like "the", "www" etc. I wander if it's possible to use the same technique to store phrases... Maxym ================================== Maxym Mykhalchuk (+39) 320 8593170 PhD student at University of Trento, ITALY ================================== ----- Original Message ----- From: "Nadav Har'El" To: Sent: Tuesday, April 11, 2006 10:33 AM Subject: Re: Small field indexing and ranking > "Maxym Mykhalchuk" wrote on 10/04/2006 09:46:16 PM: >> Here's the issue: All my "documents" will be having a few (2-3: >> title, short description) short fields. You see, it's rare that the >> same word is repeated several times in a title, so will Lucene be >> able to give me a decent ranking, or will it be able to tell me "oh, >> yes, this term is in the following 300 titles". >> >> On what I've read on the topic so far, it seems that inverted >> indexes do work good on big texts, as they are able to exploit the >> repetition of words to do ranking. > > Lucene is no psychic. If you're looking for "dog", and the document > contains two short documents, actually titles: > "Sparky the Fire Dog" > and "Dog Hause Home Page" > (just two silly titles from Google's top 10 results for "dog"...) > Then there's hardly any way for Lucene to determine which document > should be ranked higher. > > For single word queries in a situation like this, you might want > to help Lucene learn the "good" ranking. One way is to use > Document.setBoost() (or Field.setBoost) to pre-determine which > document is more "important" regardless of its text (e.g., > using some sort of link analysis, or whatever trick that is > applicable in your situation). Another way is to override > Lucene's relevance ranking with some other type of sorting > (see the Sort class) - for example, to sort all the matching > results by date, to get the newer matching results first. > In many applications, you might want to let your users control > this sort order; For example, in a shopping site (where product > names are the very short "documents"), you might want to let > the user sort the results by price, by popularity, by release > date, by users' ranking, and so on. > > For multi-word queries, it is actually possible to improve > on Lucene's standard ranking. For example, let's say you > have the two titles > "Hot Dog on a Stick" > "Your Dog in Hot Weather" > And get a query "hot dog" (without quotation marks). > Using QueryParser, Lucene will normally rank the two titles > more or less the same. However, the first one is probably > much better because the words "hot" and "dog", don't just > appear there, they actually appear very close, and in this > case even in order. > > This sort of proximity-influenced scoring is missing from > Lucene's QueryParser, and I've been wondering recently > on how it is best to add it, and whether it is possible to > easily do it with existing Lucene machinary, like the > SpanQuery class. Has anyone ever tried to do something > like this before, and can tell us their experience? > > Good Luck, > Nadav. > > -- > Nadav Har'El > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org