Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Message-ID: <008501c65d45$36b27730$6fe9a8c0@ict2011>
From: "Maxym Mykhalchuk" <maxym@dit.unitn.it>
To: <java-user@lucene.apache.org>
References: 
 <OF2E77F57F.F815FAE9-ONC225714D.002CF00C-C225714D.002F009F@il.ibm.com>
Subject: Re: Small field indexing and ranking
Date: Tue, 11 Apr 2006 10:52:07 +0200
Organization: DIT
MIME-Version: 1.0
Content-Type: text/plain;
	format=flowed;
	charset="iso-8859-1";
	reply-type=original
Content-Transfer-Encoding: 7bit

Hi Nadav,

Thanks for suggestions.

As for improving multi-word queries, Doug Cutting recently posted a link to 
his presentation, 
http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf, 
just scroll down to Nutch N-Grams there, and you'll see the answer.
Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the, 
the-vampire+0, vampire; but that's only for common terms like "the", "www" 
etc. I wander if it's possible to use the same technique to store phrases...

Maxym

==================================
Maxym Mykhalchuk
(+39) 320 8593170
PhD student at University of Trento, ITALY
==================================
----- Original Message ----- 
From: "Nadav Har'El" <NYH@il.ibm.com>
To: <java-user@lucene.apache.org>
Sent: Tuesday, April 11, 2006 10:33 AM
Subject: Re: Small field indexing and ranking


> "Maxym Mykhalchuk" <maxym@dit.unitn.it> wrote on 10/04/2006 09:46:16 PM:
>> Here's the issue: All my "documents" will be having a few (2-3:
>> title, short description) short fields. You see, it's rare that the
>> same word is repeated several times in a title, so will Lucene be
>> able to give me a decent ranking, or will it be able to tell me "oh,
>> yes, this term is in the following 300 titles".
>>
>> On what I've read on the topic so far, it seems that inverted
>> indexes do work good on big texts, as they are able to exploit the
>> repetition of words to do ranking.
>
> Lucene is no psychic. If you're looking for "dog", and the document
> contains two short documents, actually titles:
>      "Sparky the Fire Dog"
> and   "Dog Hause Home Page"
> (just two silly titles from Google's top 10 results for "dog"...)
> Then there's hardly any way for Lucene to determine which document
> should be ranked higher.
>
> For single word queries in a situation like this, you might want
> to help Lucene learn the "good" ranking. One way is to use
> Document.setBoost() (or Field.setBoost) to pre-determine which
> document is more "important" regardless of its text (e.g.,
> using some sort of link analysis, or whatever trick that is
> applicable in your situation). Another way is to override
> Lucene's relevance ranking with some other type of sorting
> (see the Sort class) - for example, to sort all the matching
> results by date, to get the newer matching results first.
> In many applications, you might want to let your users control
> this sort order; For example, in a shopping site (where product
> names are the very short "documents"), you might want to let
> the user sort the results by price, by popularity, by release
> date, by users' ranking, and so on.
>
> For multi-word queries, it is actually possible to improve
> on Lucene's standard ranking. For example, let's say you
> have the two titles
>      "Hot Dog on a Stick"
>      "Your Dog in Hot Weather"
> And get a query "hot dog" (without quotation marks).
> Using QueryParser, Lucene will normally rank the two titles
> more or less the same. However, the first one is probably
> much better because the words "hot" and "dog", don't just
> appear there, they actually appear very close, and in this
> case even in order.
>
> This sort of proximity-influenced scoring is missing from
> Lucene's QueryParser, and I've been wondering recently
> on how it is best to add it, and whether it is possible to
> easily do it with existing Lucene machinary, like the
> SpanQuery class. Has anyone ever tried to do something
> like this before, and can tell us their experience?
>
> Good Luck,
> Nadav.
>
> --
> Nadav Har'El
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org