lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <>
Subject [jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream
Date Wed, 14 May 2008 11:25:55 GMT


Grant Ingersoll commented on LUCENE-1224:

Hi Hiroaki,

I have been reviewing the tests for this and have a couple of comments.  First, I don't see
why you need to bring indexing into the equation.  Second, the changes to testNGrams still
don't test the issue, namely they don't examine that the output ngrams are actually in the
correct position.  I think you deduce this later in testIndexAndQuery, but it is never explicitly
stated.  I'd drop testIndexAndQuery and just fix testNGrams such that it checks the positions

On a more philosophical level, it is a bit curious to me that if we have the strings "abcde
fghi" that we are fine with "b" being at position 1, and not at position 0, but "ab" needs
to be at position 0.  I wonder if there is any thoughts on what the relative positions of
ngrams should be.  Should they all occur at the same position?  It seems to me, that it doesn't
make sense that the "f" ngrams don't start until some position other than 1.  This would currently
prevent doing phrase queries such as "ab fg" with no slop.

I'm assuming this applies to LUCENE-1225 as well.

I will link 1225 to this issue, and you can attach a single patch.

> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>                 Key: LUCENE-1224
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, NGramTokenFilter.patch
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index,
but I can't query it with "abc". If I query with "ab", I can get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is
based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed
is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message