lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hiroaki Kawai <>
Subject Re: NGrams and positions
Date Fri, 16 May 2008 15:49:31 GMT
I think it is not the matter what ngram in genral.

NGramTokenFilter is a TokenFilter, and this produce a
TRICKY token stream because it is processed more than 
one tokenizer.

This discussion is about the mechanism of tokenFilter

The NGramTokenFilter creates a so tricky token 
stream in the current implementation that one might be 
consider that is a new version of n-gram.

The token stream genrerated by NGramTokenFilter is 
processed not only by n-gram tokenizer but also a
mixture of the other tokenizers, so the token stream
might not look like a normal n-gram.

I think Grant is talking about StandardTokenizer + NGramTokenFilter, 

Grant Ingersoll <> wrote:
> On May 16, 2008, at 11:13 AM, Hiroaki Kawai wrote:
> > I think LUCENE-1224 is more complex than LUCENE-1225.
> >
> > First, I want to solve LUCENE-1225. It might be more
> > simple to understand.
> >
> > For LUCENE-1224, I came to the same issue. My current
> > understanding is this comes from mismatch of TokenFilter and position.
> > I apologyze for that the patch is confusing. I'm aware that the patch
> > still has another issue.
> The patch itself isn't confusing, IMO (the only issue with the patch  
> is the unit test, but that is for the JIRA discussion).  I think it  
> does what it says it does.  This discussion is more philosophical as  
> to what kinds of things people want to do with ngrams in general.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message