lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Marius Kirsch <>
Subject Re: n-gram indexing
Date Sun, 24 Jul 2005 17:45:03 GMT
Hi Rajeev,

I wrote a filter for generating n-grams a while back; I intended to
use it for statistics, but I guess you can also use it for search. I
also thought of the "boosting effect" you describe when I implemented
it, though I never actually tried whether it works that way.

It's in the Lucene bugzilla section:

Let me know if you have any problems with it.

If the lucene phrase queries honour position increments correctly,
they should work out of the box with my code. (My code adds the
unigram first with a position increment of 1, and then all n-grams
starting with that unigram with an increment of 0.) Just make sure you
use the same Analyzer for the queries as you do for the indexing, so
that n-grams are added to the query as well.

As regards the suggestion to rather use a very sloppy phrase or span
query: I expect this approach to be faster, as it would only use
TermQuery/BooleanQuery. You basically trade index size for search

If you get this to work with phrase queries, please add a test case
demonstrating it.

I saw something about n-gram queries in nutch, as far as I remember;
does anyone know whether nutch uses n-grams to speed up phrase

Regards, Sebastian

On Mon, Jul 18, 2005 at 02:27:28PM -0700, Rajesh Munavalli wrote:
> At what point do I add n-grams? Does the order in which I add n-grams
> affect exact phrase queries later? My questions are
> (1) Should I add all the 1-grams followed by 2-grams followed by
> 3-grams..etc sentence by sentence OR
> (2) Add all the 1 grams of entire document first before starting 2-grams
> for the entire document?

Sebastian Kirsch <> []

NOTE: New email address! Please update your address book.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message