lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxym Mykhalchuk" <>
Subject Re: Small field indexing and ranking
Date Tue, 11 Apr 2006 08:52:07 GMT
Hi Nadav,

Thanks for suggestions.

As for improving multi-word queries, Doug Cutting recently posted a link to 
his presentation,, 
just scroll down to Nutch N-Grams there, and you'll see the answer.
Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the, 
the-vampire+0, vampire; but that's only for common terms like "the", "www" 
etc. I wander if it's possible to use the same technique to store phrases...


Maxym Mykhalchuk
(+39) 320 8593170
PhD student at University of Trento, ITALY
----- Original Message ----- 
From: "Nadav Har'El" <>
To: <>
Sent: Tuesday, April 11, 2006 10:33 AM
Subject: Re: Small field indexing and ranking

> "Maxym Mykhalchuk" <> wrote on 10/04/2006 09:46:16 PM:
>> Here's the issue: All my "documents" will be having a few (2-3:
>> title, short description) short fields. You see, it's rare that the
>> same word is repeated several times in a title, so will Lucene be
>> able to give me a decent ranking, or will it be able to tell me "oh,
>> yes, this term is in the following 300 titles".
>> On what I've read on the topic so far, it seems that inverted
>> indexes do work good on big texts, as they are able to exploit the
>> repetition of words to do ranking.
> Lucene is no psychic. If you're looking for "dog", and the document
> contains two short documents, actually titles:
>      "Sparky the Fire Dog"
> and   "Dog Hause Home Page"
> (just two silly titles from Google's top 10 results for "dog"...)
> Then there's hardly any way for Lucene to determine which document
> should be ranked higher.
> For single word queries in a situation like this, you might want
> to help Lucene learn the "good" ranking. One way is to use
> Document.setBoost() (or Field.setBoost) to pre-determine which
> document is more "important" regardless of its text (e.g.,
> using some sort of link analysis, or whatever trick that is
> applicable in your situation). Another way is to override
> Lucene's relevance ranking with some other type of sorting
> (see the Sort class) - for example, to sort all the matching
> results by date, to get the newer matching results first.
> In many applications, you might want to let your users control
> this sort order; For example, in a shopping site (where product
> names are the very short "documents"), you might want to let
> the user sort the results by price, by popularity, by release
> date, by users' ranking, and so on.
> For multi-word queries, it is actually possible to improve
> on Lucene's standard ranking. For example, let's say you
> have the two titles
>      "Hot Dog on a Stick"
>      "Your Dog in Hot Weather"
> And get a query "hot dog" (without quotation marks).
> Using QueryParser, Lucene will normally rank the two titles
> more or less the same. However, the first one is probably
> much better because the words "hot" and "dog", don't just
> appear there, they actually appear very close, and in this
> case even in order.
> This sort of proximity-influenced scoring is missing from
> Lucene's QueryParser, and I've been wondering recently
> on how it is best to add it, and whether it is possible to
> easily do it with existing Lucene machinary, like the
> SpanQuery class. Has anyone ever tried to do something
> like this before, and can tell us their experience?
> Good Luck,
> Nadav.
> --
> Nadav Har'El
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message