Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of NYH@il.ibm.com designates
 195.212.29.152 as permitted sender)
In-Reply-To: <022f01c65ccf$0c7e0950$6fe9a8c0@ict2011>
Subject: Re: Small field indexing and ranking
To: java-user@lucene.apache.org
Message-ID: 
 <OF2E77F57F.F815FAE9-ONC225714D.002CF00C-C225714D.002F009F@il.ibm.com>
From: "Nadav Har'El" <NYH@il.ibm.com>
Date: Tue, 11 Apr 2006 11:33:23 +0300
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII

"Maxym Mykhalchuk" <maxym@dit.unitn.it> wrote on 10/04/2006 09:46:16 PM:
> Here's the issue: All my "documents" will be having a few (2-3:
> title, short description) short fields. You see, it's rare that the
> same word is repeated several times in a title, so will Lucene be
> able to give me a decent ranking, or will it be able to tell me "oh,
> yes, this term is in the following 300 titles".
>
> On what I've read on the topic so far, it seems that inverted
> indexes do work good on big texts, as they are able to exploit the
> repetition of words to do ranking.

Lucene is no psychic. If you're looking for "dog", and the document
contains two short documents, actually titles:
      "Sparky the Fire Dog"
and   "Dog Hause Home Page"
(just two silly titles from Google's top 10 results for "dog"...)
Then there's hardly any way for Lucene to determine which document
should be ranked higher.

For single word queries in a situation like this, you might want
to help Lucene learn the "good" ranking. One way is to use
Document.setBoost() (or Field.setBoost) to pre-determine which
document is more "important" regardless of its text (e.g.,
using some sort of link analysis, or whatever trick that is
applicable in your situation). Another way is to override
Lucene's relevance ranking with some other type of sorting
(see the Sort class) - for example, to sort all the matching
results by date, to get the newer matching results first.
In many applications, you might want to let your users control
this sort order; For example, in a shopping site (where product
names are the very short "documents"), you might want to let
the user sort the results by price, by popularity, by release
date, by users' ranking, and so on.

For multi-word queries, it is actually possible to improve
on Lucene's standard ranking. For example, let's say you
have the two titles
      "Hot Dog on a Stick"
      "Your Dog in Hot Weather"
And get a query "hot dog" (without quotation marks).
Using QueryParser, Lucene will normally rank the two titles
more or less the same. However, the first one is probably
much better because the words "hot" and "dog", don't just
appear there, they actually appear very close, and in this
case even in order.

This sort of proximity-influenced scoring is missing from
Lucene's QueryParser, and I've been wondering recently
on how it is best to add it, and whether it is possible to
easily do it with existing Lucene machinary, like the
SpanQuery class. Has anyone ever tried to do something
like this before, and can tell us their experience?

Good Luck,
Nadav.

--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org