lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-973) Token of "" returns in CJKTokenizer + new TestCJKTokenizer
Date Tue, 16 Jun 2009 16:09:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720207#action_12720207
] 

Michael McCandless commented on LUCENE-973:
-------------------------------------------

Well, my question is: is there any input text that would cause an arbitrary number of such
0-length tokens in a row?

Eg the original cause of that was just at the boundary of two byte character and one byte
character... so if that's the only case that hits 0-length token, then we are OK.  But if
there are other cases, such that one could chain any number of such tokens in sequence, we're
not, and we have to translate recursion into iteration.


> Token of  "" returns in CJKTokenizer + new TestCJKTokenizer
> -----------------------------------------------------------
>
>                 Key: LUCENE-973
>                 URL: https://issues.apache.org/jira/browse/LUCENE-973
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Toru Matsuzawa
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, LUCENE-973.patch,
with-patch.jpg, without-patch.jpg
>
>
> The "" string returns as Token in the boundary of two byte character and one byte character.

> There is no problem in CJKAnalyzer. 
> When CJKTokenizer is used with the unit, it becomes a problem. (Use it with 
> Solr etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message