lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Munavalli" <raje...@dessci.com>
Subject RE: n-gram indexing
Date Mon, 18 Jul 2005 22:06:37 GMT
Intution behind adding n-grams is to boost naturally occurring larger
phrases versus using phrase queries. For example, if I am searching for
"united states of america", I want the search results to return the
documents ordered as follows

Rank 1 - Documents containing all the words occurring together
Rank 2 - Documents containing maximum number of words in the same
sentence
Rank 3 - Documents containing all the words but some might appear in the
same sentence some may not
Rank 4 - Documents containig atleast one or two words

If we have a n-gram index, most probably document talking about "united
states" gets preference over document containing "united" and "states"
seperately. If I am correct, this can be achieved without using phrase
queries. I am not sure if there is a better way to achieve the same
effect.

Thanks,

Rajesh


-----Original Message-----
From: Andy Roberts [mailto:mail@andy-roberts.net] 
Sent: Monday, July 18, 2005 5:56 PM
To: java-user@lucene.apache.org
Subject: Re: n-gram indexing

On Monday 18 Jul 2005 21:27, Rajesh Munavalli wrote:
> At what point do I add n-grams? Does the order in which I add n-grams 
> affect exact phrase queries later? My questions are
>
> (1) Should I add all the 1-grams followed by 2-grams followed by 
> 3-grams..etc sentence by sentence OR
>
> (2) Add all the 1 grams of entire document first before starting 
> 2-grams for the entire document?
>
> What is the general accepted notion of adding n-grams of a document?
>
> thanks,
>
> Rajesh

I can't see any real advantage of storing n-grams explicitly. Just index
the document and use phrase queries. Order is significant with phrase
queries if I recall correctly, although you can use SpanNearQueries to
look for unordered ngrams, although I don't know why you would want to!

Perhaps if you explain a little more about what you are trying to
achieve more generally, we can confirm that you don't need to mess with
explicit indexing of indexing.

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message