lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Updated: (LUCENE-1689) supplementary character handling
Date Mon, 15 Jun 2009 01:14:07 GMT


Robert Muir updated LUCENE-1689:

    Attachment: testCurrentBehavior.txt

this is just a patch with testcases showing the existing behavior.

perhaps these should be fixed:
Simple/Standard/StopAnalyzer/etc: deletes all supp. characters completely.
LowerCaseFilter: doesn't lowercase supp. characters correctly.
WildcardQuery: ? operator does not work correctly.

perhaps these just need some javadocs:
FuzzyQuery: scoring is strange because its based upon surrogates, leave alone and javadoc
LengthFilter: length is calculated based on utf-16 code units, leave alone and javadoc it.

... and theres always the option to not change any code, but just javadoc all the behavior
as a "fix", providing stuff in contrib or elsewhere that works correctly.
let me know what you think.

> supplementary character handling
> --------------------------------
>                 Key: LUCENE-1689
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they
don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize()
use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters
should be the exception, but still work.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message