lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject NGrams and positions
Date Thu, 15 May 2008 12:47:46 GMT

Do people have an opinion on what positions ngrams should be output  
at?  For instance, given 1-grams on "abc fgh", these are currently  
output as: a, b, c, f, g,h all with a position increment of 1.  That  
seems somewhat reasonable, but it has tradeoffs, namely you can't  
query for something like: "a f" without some amount of slop, which I  
think is a reasonable thing to do (but don't have an actual use case  
for at the moment.)  An alternative way might be to output a, b, c all  
at the same position, then increment for f and then put g and h at the  
same position.

I am _wondering_ whether it makes more sense to add an option to the  
NGram token streams such that we could have the choice of either  
outputting the n-grams within a "token" at the same position or at  
successive positions (to be back-compatible.)  It isn't clear to me  
which is correct, or if there is even a notion of correctness here, in  
so much as they are both correct if that is the functionality you want  
in your application.  As DM Smith noted, if Lucene supported the  
notion of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b  
and 2.c for the example above, but that capability doesn't exist in  
Lucene right now, AFAIK.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message