lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: NGrams and positions
Date Fri, 16 May 2008 12:24:49 GMT
I think there is also the use case of using ngrams as a substitute for  
wildcard queries, and you could use then imagine doing a phrase query  
across those tokens.  So, if you had the words "quick red fox" and you  
output 1-grams (as an example), you could do a query like "q r f" in  
place of "q* r* f*".


On May 15, 2008, at 12:54 PM, Doug Cutting wrote:

> The conventional use of ngrams when searching is not to treat them  
> as a set but a sequence.  Thus, for "foola" you could index the  
> sequence ["_f", "fo", "oo", "ol", "la", "a_"], and then search for  
> the  phrase ["oo", "ol"] to find all occurences of "ool".  This is  
> useful in languages that use logograms without spaces, like Japanese  
> and Chinese, and in other cases (e.g., Nutch uses word-ngrams to  
> optimize searches for phrases containing very common terms).
>
> Do you have a use-case for the alternative, where n-grams are  
> treated as a set, rather than a sequence?
>
> Doug
>
> Grant Ingersoll wrote:
>> See https://issues.apache.org/jira/browse/LUCENE-1224
>> Do people have an opinion on what positions ngrams should be output  
>> at?  For instance, given 1-grams on "abc fgh", these are currently  
>> output as: a, b, c, f, g,h all with a position increment of 1.   
>> That seems somewhat reasonable, but it has tradeoffs, namely you  
>> can't query for something like: "a f" without some amount of slop,  
>> which I think is a reasonable thing to do (but don't have an actual  
>> use case for at the moment.)  An alternative way might be to output  
>> a, b, c all at the same position, then increment for f and then put  
>> g and h at the same position.
>> I am _wondering_ whether it makes more sense to add an option to  
>> the NGram token streams such that we could have the choice of  
>> either outputting the n-grams within a "token" at the same position  
>> or at successive positions (to be back-compatible.)  It isn't clear  
>> to me which is correct, or if there is even a notion of correctness  
>> here, in so much as they are both correct if that is the  
>> functionality you want in your application.  As DM Smith noted, if  
>> Lucene supported the notion of "sub" positions, one could output  
>> 1.a, 1.b, 1.c, 2.a, 2.b and 2.c for the example above, but that  
>> capability doesn't exist in Lucene right now, AFAIK.
>> -Grant
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message