Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 88415 invoked from network); 18 Jul 2005 22:14:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 18 Jul 2005 22:14:11 -0000 Received: (qmail 95694 invoked by uid 500); 18 Jul 2005 22:14:06 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 95657 invoked by uid 500); 18 Jul 2005 22:14:05 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 95644 invoked by uid 99); 18 Jul 2005 22:14:05 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Jul 2005 15:14:05 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [195.92.193.211] (HELO cmailm4.svr.pol.co.uk) (195.92.193.211) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Jul 2005 15:14:00 -0700 Received: from user-2944.l2.c2.dsl.pol.co.uk ([81.77.107.128] helo=[192.168.1.2]) by cmailm4.svr.pol.co.uk with esmtp (Exim 4.41) id 1DudsA-0007ZS-Tf for java-user@lucene.apache.org; Mon, 18 Jul 2005 23:14:03 +0100 From: Andy Roberts To: java-user@lucene.apache.org Subject: Re: n-gram indexing Date: Mon, 18 Jul 2005 23:16:13 +0000 User-Agent: KMail/1.8.1 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200507182316.13810.mail@andy-roberts.net> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Monday 18 Jul 2005 22:06, Rajesh Munavalli wrote: > Intution behind adding n-grams is to boost naturally occurring larger > phrases versus using phrase queries. For example, if I am searching for > "united states of america", I want the search results to return the > documents ordered as follows > > Rank 1 - Documents containing all the words occurring together > Rank 2 - Documents containing maximum number of words in the same > sentence > Rank 3 - Documents containing all the words but some might appear in the > same sentence some may not > Rank 4 - Documents containig atleast one or two words > > If we have a n-gram index, most probably document talking about "united > states" gets preference over document containing "united" and "states" > seperately. If I am correct, this can be achieved without using phrase > queries. I am not sure if there is a better way to achieve the same > effect. > I don't think ngrams will help either. You could perform a set of individual queries. Firstly, run the phrase query to find hits with the exact phrase, then perhaps run a SpanNear query to find the docs with the terms close to each other. Thirdly, do a boolean AND query for all terms and fourthly run an OR boolean query. It will require a little extra processing of course, as you are technically executing 4 queries in 1. Naturally, this only has to be done when there are more than one term in the search query. Also, there is obviously going to be some duplication of hits, so you could use a HashMap when iterating of the Hits to ensure you get unique hits when the queries are collated. Andy --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org