lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: NGrams and positions
Date Thu, 15 May 2008 16:54:43 GMT
The conventional use of ngrams when searching is not to treat them as a 
set but a sequence.  Thus, for "foola" you could index the sequence 
["_f", "fo", "oo", "ol", "la", "a_"], and then search for the  phrase 
["oo", "ol"] to find all occurences of "ool".  This is useful in 
languages that use logograms without spaces, like Japanese and Chinese, 
and in other cases (e.g., Nutch uses word-ngrams to optimize searches 
for phrases containing very common terms).

Do you have a use-case for the alternative, where n-grams are treated as 
a set, rather than a sequence?

Doug

Grant Ingersoll wrote:
> See https://issues.apache.org/jira/browse/LUCENE-1224
> 
> Do people have an opinion on what positions ngrams should be output at?  
> For instance, given 1-grams on "abc fgh", these are currently output as: 
> a, b, c, f, g,h all with a position increment of 1.  That seems somewhat 
> reasonable, but it has tradeoffs, namely you can't query for something 
> like: "a f" without some amount of slop, which I think is a reasonable 
> thing to do (but don't have an actual use case for at the moment.)  An 
> alternative way might be to output a, b, c all at the same position, 
> then increment for f and then put g and h at the same position.
> 
> I am _wondering_ whether it makes more sense to add an option to the 
> NGram token streams such that we could have the choice of either 
> outputting the n-grams within a "token" at the same position or at 
> successive positions (to be back-compatible.)  It isn't clear to me 
> which is correct, or if there is even a notion of correctness here, in 
> so much as they are both correct if that is the functionality you want 
> in your application.  As DM Smith noted, if Lucene supported the notion 
> of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b and 2.c for 
> the example above, but that capability doesn't exist in Lucene right 
> now, AFAIK.
> 
> -Grant
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message