lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: NGrams and positions
Date Thu, 15 May 2008 16:54:43 GMT
The conventional use of ngrams when searching is not to treat them as a 
set but a sequence.  Thus, for "foola" you could index the sequence 
["_f", "fo", "oo", "ol", "la", "a_"], and then search for the  phrase 
["oo", "ol"] to find all occurences of "ool".  This is useful in 
languages that use logograms without spaces, like Japanese and Chinese, 
and in other cases (e.g., Nutch uses word-ngrams to optimize searches 
for phrases containing very common terms).

Do you have a use-case for the alternative, where n-grams are treated as 
a set, rather than a sequence?


Grant Ingersoll wrote:
> See
> Do people have an opinion on what positions ngrams should be output at?  
> For instance, given 1-grams on "abc fgh", these are currently output as: 
> a, b, c, f, g,h all with a position increment of 1.  That seems somewhat 
> reasonable, but it has tradeoffs, namely you can't query for something 
> like: "a f" without some amount of slop, which I think is a reasonable 
> thing to do (but don't have an actual use case for at the moment.)  An 
> alternative way might be to output a, b, c all at the same position, 
> then increment for f and then put g and h at the same position.
> I am _wondering_ whether it makes more sense to add an option to the 
> NGram token streams such that we could have the choice of either 
> outputting the n-grams within a "token" at the same position or at 
> successive positions (to be back-compatible.)  It isn't clear to me 
> which is correct, or if there is even a notion of correctness here, in 
> so much as they are both correct if that is the functionality you want 
> in your application.  As DM Smith noted, if Lucene supported the notion 
> of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b and 2.c for 
> the example above, but that capability doesn't exist in Lucene right 
> now, AFAIK.
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message