Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 4661 invoked from network); 9 Oct 2008 23:05:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Oct 2008 23:05:42 -0000 Received: (qmail 58393 invoked by uid 500); 9 Oct 2008 23:05:34 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 58316 invoked by uid 500); 9 Oct 2008 23:05:34 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 58291 invoked by uid 99); 9 Oct 2008 23:05:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Oct 2008 16:05:34 -0700 X-ASF-Spam-Status: No, hits=-1999.9 required=10.0 tests=ALL_TRUSTED,DNS_FROM_SECURITYSAGE X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Oct 2008 23:04:38 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4A3E3234C21A for ; Thu, 9 Oct 2008 16:04:44 -0700 (PDT) Message-ID: <417810281.1223593484302.JavaMail.jira@brutus> Date: Thu, 9 Oct 2008 16:04:44 -0700 (PDT) From: "Todd Feak (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream In-Reply-To: <768992940.1205314249998.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638423#action_12638423 ] Todd Feak commented on LUCENE-1224: ----------------------------------- This bug caused me *major* headaches trying to figure out why substring matching with an NGramTokenFilter wasn't working for anything other then when setting min and max to the same values. The patch seems to fix the issue when applied locally, however it also has a bug in it. It will stop parsing a token stream if a token comes through that is less then the minGramSize, even if there are tokens yet in the stream that are greater then minGramSize. > NGramTokenFilter creates bad TokenStream > ---------------------------------------- > > Key: LUCENE-1224 > URL: https://issues.apache.org/jira/browse/LUCENE-1224 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* > Reporter: Hiroaki Kawai > Assignee: Grant Ingersoll > Priority: Critical > Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, NGramTokenFilter.patch > > > With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index, but I can't query it with "abc". If I query with "ab", I can get a hit result. > The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed is based on the order (Token.positionIncrement). > With current filter, query string "abc" is tokenized to : ab bc abc > meaning "query a string that has ab bc abc in this order". > Expected filter will generate : ab abc(positionIncrement=0) bc > meaning "query a string that has (ab|abc) bc in this order" > I'd like to submit a patch for this issue. :-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org