harmony-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anton Ivanov (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HARMONY-688) java.util.regex.Matcher does not support Unicode supplementary characters
Date Wed, 27 Sep 2006 12:50:02 GMT
     [ http://issues.apache.org/jira/browse/HARMONY-688?page=all ]

Anton Ivanov updated HARMONY-688:
---------------------------------

    Attachment: patch_src.txt
                patch_tests.txt

This patch adds Unicode supplementary characters support to java.util.regex package.
List of changes:

patch_src.diff 

changed files:

-ReluctantQuantifierSet.java
-CompositeQuantifierSet.java
    Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint
now.
-EmptySet.java
    Methods find() and findBack() are implemented to override default find() and findBack()
implementation in order
    to not find an empty string in the middle of a surrogate pair.
-Lexer.java
    Removed Character's methods calls via Lexer stubs methods. Using Character's method calls
directly now.
    Removed these stubs too.
    Added method nextCodePoint() to read supplementary codepoints from an input string not
as a pair of chars.
    Added methods to determine if the given value is of high surrogate or low surrogate range.

-SequenceSet.java
    Added support for new classes to method first().
-DotQuantifierSet.java
    We can build DotQuantifierSet over AbstractSets now since DotSet is a subclass of JointSet
now.
-DotSet.java
-DotAllSet.java
    Now dot construction can consume one (not supplementary codepoint is consumed) or two

    (supplementary codepoint consisting of 2 chars is consumed) chars, so they are not LeafSets
any more and
    we subclass them from JointSet. And we have to implement matches() method for both of
these classes due to this.
-CharClass.java
    Added support for splitting character class into two parts: only surrogate codepoints,
without surrogate codepoints.
-LeafQuantifierSet.java
-UnifiedQuantifierSet.java
    Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint
now.
-SingleDecompositions.java
    Fixing comments.
-RangeSet.java
    Added support for new classes to method first().
-Pattern.java
    Added support for compilation of constructions with surrogate codepoints into corresponding
nodes.
-DotAllQuantifierSet.java
    We can build DotAllQuantifierSet over AbstractSets now since DotAllSet is a subclass of
JointSet now.
-PosPlusGroupQuantifierSet.java
    New classes are subclasses of JointSets, but they are not normal JointSets and they have
no FSet field, so
    we fix this issue.
-CharSet.java
    Fixing issue with toString() call to CharSequence object.
    Added support for new classes to method first().
-UCICharSet.java
    Removing unused method getChar().
-DecomposedCharSet.java
    Removed Character's methods calls via Lexer stubs methods. Using Character's method calls
directly now.
-UCIRangeSet.java
    Added constructor.
-AltQuantifierSet.java
    Small changes with indexes processing due to LeafSet can contain 2 chars for one codepoint
now.
-AbstractCharClass.java
    Added support for splitting character class into two parts: only surrogate codepoints,
without surrogate codepoints.

new files:

-  CompositeRangeSet.java       
     This class is used to split the range that contains surrogate characters into two ranges:
the first consisting of these surrogate characters and the second consisting of all other
characters from the parent range.
     This class represents the parent range splitted in such a manner.
-  HighSurrogateCharSet.java
     This class represents high surrogate character.
-  LowHighSurrogateRangeSet.java  
     This class is a range that contains only surrogate characters.
-  LowSurrogateCharSet.java   
     This class represents low surrogate character.
-  SupplCharSet.java
     Represents node accepting single supplementary codepoint.
-  SupplRangeSet.java
     Represents node accepting single character from the given char class.
     This character can be supplementary (2 chars needed to represent) or from 
     basic multilingual pane (1 needed char to represent it).
-  UCISupplCharSet.java  
     Represents node accepting single supplementary 
     codepoint in Unicode case insensitive manner.
-  UCISupplRangeSet.java  
     Represents node accepting single character from the given char class
     in Unicode case insensitive manner.
     This character can be supplementary (2 chars to represent) or from 
     basic multilingual pane (1 char to represent).

patch_tests.diff

      Added unit tests for using supplementary characters and surrogate codepoints.

Thanks,
Anton

> java.util.regex.Matcher does not support Unicode supplementary characters
> -------------------------------------------------------------------------
>
>                 Key: HARMONY-688
>                 URL: http://issues.apache.org/jira/browse/HARMONY-688
>             Project: Harmony
>          Issue Type: Bug
>          Components: Classlib
>            Reporter: Richard Liang
>         Attachments: patch_src.txt, patch_tests.txt
>
>
> Hello Nikolay,
> The following test case pass on RI, but fail on Harmony.  Would you please have a look
at this issue? Thanks a lot.
>     public void test_matcher() {
>         Pattern p = Pattern.compile("\\p{javaLowerCase}");
>         Matcher matcher = p.matcher("\uD801\uDC28");
>         assertTrue(matcher.find());
>     }
> Best regards,
> Richard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message