Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 30806 invoked from network); 16 Jun 2009 16:09:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Jun 2009 16:09:21 -0000 Received: (qmail 2467 invoked by uid 500); 16 Jun 2009 16:09:32 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 2383 invoked by uid 500); 16 Jun 2009 16:09:31 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 2372 invoked by uid 99); 16 Jun 2009 16:09:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2009 16:09:31 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2009 16:09:28 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 5BA72234C044 for ; Tue, 16 Jun 2009 09:09:07 -0700 (PDT) Message-ID: <318655247.1245168547360.JavaMail.jira@brutus> Date: Tue, 16 Jun 2009 09:09:07 -0700 (PDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-973) Token of "" returns in CJKTokenizer + new TestCJKTokenizer In-Reply-To: <4002653.1186493520291.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720207#action_12720207 ] Michael McCandless commented on LUCENE-973: ------------------------------------------- Well, my question is: is there any input text that would cause an arbitrary number of such 0-length tokens in a row? Eg the original cause of that was just at the boundary of two byte character and one byte character... so if that's the only case that hits 0-length token, then we are OK. But if there are other cases, such that one could chain any number of such tokens in sequence, we're not, and we have to translate recursion into iteration. > Token of "" returns in CJKTokenizer + new TestCJKTokenizer > ----------------------------------------------------------- > > Key: LUCENE-973 > URL: https://issues.apache.org/jira/browse/LUCENE-973 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 2.3 > Reporter: Toru Matsuzawa > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, LUCENE-973.patch, with-patch.jpg, without-patch.jpg > > > The "" string returns as Token in the boundary of two byte character and one byte character. > There is no problem in CJKAnalyzer. > When CJKTokenizer is used with the unit, it becomes a problem. (Use it with > Solr etc.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org