lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: NGrams and positions
Date Thu, 15 May 2008 20:20:08 GMT
I think the original use-case is in LUCENE-1224 where Hiroaki wrote this:

With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef"
string into an index, but I can't query it with "abc". If I query with
"ab", I can get a hit result. 

The reason is that the NGramTokenFilter generates badly ordered
TokenStream. Query is based on the Token order in the TokenStream, that
how stemming or phrase should be anlayzed is based on the order
(Token.positionIncrement).

With current filter, query string "abc" is tokenized to : ab bc abc 
meaning "query a string that has ab bc abc in this order".
Expected filter will generate : ab abc(positionIncrement=0) bc
meaning "query a string that has (ab|abc) bc in this order"


I did not verify if what Hiroaki wrote really is correct, but I assume it is because he tested
the current behaviour.I find the above a little hard to follow, so I had to write it out for
myself, like this:

Current filter:      ab(0) bc(1) abc(2)
Hiroaki's filter:    ab(0) abc(0) bc(1)

Hiroaki's "abc" query after tokenization: abc -> ab(0) abc(0), bc(1)

If the indexed text is "abcdef" this allows him to search for "abc", which gets translated
into "ab|abc(0) bc(1)", whereas the current code will take "abc" and translate that to "ab(0)
bc(1) abc(2)", and this just won't match, he says.  Now that I wrote this, I'm getting more
confused -- shouldn't either filter produce the same n-grams at index and search time, with
the same positions, thus yielding hits?  Bah, confused...


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Doug Cutting <cutting@apache.org>
> To: java-dev@lucene.apache.org
> Sent: Thursday, May 15, 2008 12:54:43 PM
> Subject: Re: NGrams and positions
> 
> The conventional use of ngrams when searching is not to treat them as a 
> set but a sequence.  Thus, for "foola" you could index the sequence 
> ["_f", "fo", "oo", "ol", "la", "a_"], and then search for the  phrase 
> ["oo", "ol"] to find all occurences of "ool".  This is useful in 
> languages that use logograms without spaces, like Japanese and Chinese, 
> and in other cases (e.g., Nutch uses word-ngrams to optimize searches 
> for phrases containing very common terms).
> 
> Do you have a use-case for the alternative, where n-grams are treated as 
> a set, rather than a sequence?
> 
> Doug
> 
> Grant Ingersoll wrote:
> > See https://issues.apache.org/jira/browse/LUCENE-1224
> > 
> > Do people have an opinion on what positions ngrams should be output at?  
> > For instance, given 1-grams on "abc fgh", these are currently output as: 
> > a, b, c, f, g,h all with a position increment of 1.  That seems somewhat 
> > reasonable, but it has tradeoffs, namely you can't query for something 
> > like: "a f" without some amount of slop, which I think is a reasonable 
> > thing to do (but don't have an actual use case for at the moment.)  An 
> > alternative way might be to output a, b, c all at the same position, 
> > then increment for f and then put g and h at the same position.
> > 
> > I am _wondering_ whether it makes more sense to add an option to the 
> > NGram token streams such that we could have the choice of either 
> > outputting the n-grams within a "token" at the same position or at 
> > successive positions (to be back-compatible.)  It isn't clear to me 
> > which is correct, or if there is even a notion of correctness here, in 
> > so much as they are both correct if that is the functionality you want 
> > in your application.  As DM Smith noted, if Lucene supported the notion 
> > of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b and 2.c for 
> > the example above, but that capability doesn't exist in Lucene right 
> > now, AFAIK.
> > 
> > -Grant
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message