lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: NGrams and positions
Date Fri, 16 May 2008 12:26:29 GMT
Note, also, I am proposing to have the option, I agree this is a  
valid, good case.  One could use payloads and some new fangled query  
to do the sub position thing, just for completeness here.


On May 15, 2008, at 12:54 PM, Doug Cutting wrote:

> The conventional use of ngrams when searching is not to treat them  
> as a set but a sequence.  Thus, for "foola" you could index the  
> sequence ["_f", "fo", "oo", "ol", "la", "a_"], and then search for  
> the  phrase ["oo", "ol"] to find all occurences of "ool".  This is  
> useful in languages that use logograms without spaces, like Japanese  
> and Chinese, and in other cases (e.g., Nutch uses word-ngrams to  
> optimize searches for phrases containing very common terms).
> Do you have a use-case for the alternative, where n-grams are  
> treated as a set, rather than a sequence?
> Doug
> Grant Ingersoll wrote:
>> See
>> Do people have an opinion on what positions ngrams should be output  
>> at?  For instance, given 1-grams on "abc fgh", these are currently  
>> output as: a, b, c, f, g,h all with a position increment of 1.   
>> That seems somewhat reasonable, but it has tradeoffs, namely you  
>> can't query for something like: "a f" without some amount of slop,  
>> which I think is a reasonable thing to do (but don't have an actual  
>> use case for at the moment.)  An alternative way might be to output  
>> a, b, c all at the same position, then increment for f and then put  
>> g and h at the same position.
>> I am _wondering_ whether it makes more sense to add an option to  
>> the NGram token streams such that we could have the choice of  
>> either outputting the n-grams within a "token" at the same position  
>> or at successive positions (to be back-compatible.)  It isn't clear  
>> to me which is correct, or if there is even a notion of correctness  
>> here, in so much as they are both correct if that is the  
>> functionality you want in your application.  As DM Smith noted, if  
>> Lucene supported the notion of "sub" positions, one could output  
>> 1.a, 1.b, 1.c, 2.a, 2.b and 2.c for the example above, but that  
>> capability doesn't exist in Lucene right now, AFAIK.
>> -Grant
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message