lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-1689) supplementary character handling
Date Sat, 13 Jun 2009 17:02:07 GMT


Robert Muir commented on LUCENE-1689:

i forgot to answer your question Michael: 

it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints
Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it?

I do not think it guarantees this for all future unicode versions. In my opinion, we should
exploit things like this if I can show a test case that proves its true for all codepoint
in the current version of unicode :)
And it should be documented that this could possibly change in some future version.
In this example, its a nice simplification because it guarantees the length (in code units)
will not change!

I think for a next step on this issue I will create and upload a test case showing the issues
and detailing some possible solutions.
For some of them, maybe a javadoc update is the most appropriate, but for others, maybe an
API change is the right way to go.
Then we can figure out what should be done.

> supplementary character handling
> --------------------------------
>                 Key: LUCENE-1689
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1689_lowercase_example.txt
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they
don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize()
use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters
should be the exception, but still work.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message