Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
From: Andy Roberts <mail@andy-roberts.net>
To: java-user@lucene.apache.org
Subject: Re: n-gram indexing
Date: Mon, 18 Jul 2005 23:16:13 +0000
User-Agent: KMail/1.8.1
References: <D1EFB337111B674B8F1BE155B01C6DD635576B@franklin.corp.dessci>
In-Reply-To: <D1EFB337111B674B8F1BE155B01C6DD635576B@franklin.corp.dessci>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200507182316.13810.mail@andy-roberts.net>

On Monday 18 Jul 2005 22:06, Rajesh Munavalli wrote:
> Intution behind adding n-grams is to boost naturally occurring larger
> phrases versus using phrase queries. For example, if I am searching for
> "united states of america", I want the search results to return the
> documents ordered as follows
>
> Rank 1 - Documents containing all the words occurring together
> Rank 2 - Documents containing maximum number of words in the same
> sentence
> Rank 3 - Documents containing all the words but some might appear in the
> same sentence some may not
> Rank 4 - Documents containig atleast one or two words
>
> If we have a n-gram index, most probably document talking about "united
> states" gets preference over document containing "united" and "states"
> seperately. If I am correct, this can be achieved without using phrase
> queries. I am not sure if there is a better way to achieve the same
> effect.
>

I don't think ngrams will help either. You could perform a set of individual 
queries. Firstly, run the phrase query to find hits with the exact phrase, 
then perhaps run a SpanNear query to find the docs with the terms close to 
each other. Thirdly, do a boolean AND query for all terms and fourthly run an 
OR boolean query. It will require a little extra processing of course, as you 
are technically executing 4 queries in 1. Naturally, this only has to be done 
when there are more than one term in the search query. Also, there is 
obviously going to be some duplication of hits, so you could use a HashMap 
when iterating of the Hits to ensure you get unique hits when the queries are 
collated.

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org