lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <>
Subject [jira] Commented: (LUCENE-1216) CharDelimiterTokenizer
Date Thu, 15 May 2008 16:15:55 GMT


Otis Gospodnetic commented on LUCENE-1216:

Aha, that makes sense - thanks for clarifying.  I think I'm not the only one who won't immediately
realize that setWhitespaceDelimiter delimits on all isWhitespace characters, so it would be
good to add that to the javadoc.

Could you please do that and upload the new class + its unit test class as a patch?


> CharDelimiterTokenizer
> ----------------------
>                 Key: LUCENE-1216
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Hiroaki Kawai
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments:,,
> WhitespaceTokenizer is very useful for space separated languages, but my Japanese text
is not always separated by a space. So, I created an alternative Tokenizer that we can specify
the delimiter. The file submitted will be an improvement of the current WhitespaceTokenizer.
> I tried to extend it from CharTokenizer, but CharTokenizer has a limitation that a token
can't be longer than 255 chars.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message