lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <>
Subject [jira] [Commented] (LUCENE-1689) supplementary character handling
Date Wed, 19 Sep 2012 19:22:07 GMT


Steven Rowe commented on LUCENE-1689:

Robert, is there anything left to do here?  I think this issue can be resolved as fixed.
> supplementary character handling
> --------------------------------
>                 Key: LUCENE-1689
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 4.1
>         Attachments: LUCENE-1689_lowercase_example.txt, LUCENE-1689.patch, LUCENE-1689.patch,
LUCENE-1689.patch, testCurrentBehavior.txt
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they
don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize()
use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters
should be the exception, but still work.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message