Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <417810281.1223593484302.JavaMail.jira@brutus>
Date: Thu, 9 Oct 2008 16:04:44 -0700 (PDT)
From: "Todd Feak (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad
 TokenStream
In-Reply-To: <768992940.1205314249998.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638423#action_12638423 ] 

Todd Feak commented on LUCENE-1224:
-----------------------------------

This bug caused me *major* headaches trying to figure out why substring matching with an NGramTokenFilter wasn't working for anything other then when setting min and max to the same values. 

The patch seems to fix the issue when applied locally, however it also has a bug in it. It will stop parsing a token stream if a token comes through that is less then the minGramSize, even if there are tokens yet in the stream that are greater then minGramSize.

> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index, but I can't query it with "abc". If I query with "ab", I can get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org