commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastiaan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TEXT-161) Should there be a better implementation of substring that deals with Unicode surrogate pairs correctly?
Date Sat, 13 Apr 2019 10:01:00 GMT

     [ https://issues.apache.org/jira/browse/TEXT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastiaan updated TEXT-161:
----------------------------
    Description: 
There are some major problems with Java's substring implementation which works using chars.
For a brief overview read this blog post: [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]

 

I have some demo code showing the issues and a possible solution here:
{code:java}
public class SubstringTest {
    public static void main(String[] args) {

        String stringWithPlus2ByteCodePoints = "👦👩👪👫";

        String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
        String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
        String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);

        System.out.println(stringWithPlus2ByteCodePoints);
        System.out.println("invalid sub: " + substring1);
        System.out.println("invalid sub: " + substring2);
        System.out.println("invalid sub: " + substring3);

        String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
        String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
        String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
        System.out.println("real sub: " + realSub1);
        System.out.println("real sub: " + realSub2);
        System.out.println("real sub: " + realSub3);
    }

    private static String getRealSubstring(String string, int beginIndex, int endIndex) {
        if (string == null)
            throw new IllegalArgumentException("String should not be null");
        int length = string.length();
        if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || endIndex
> length)
            throw new IllegalArgumentException("Invalid indices");
        int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
        int realEndIndex = string.offsetByCodePoints(0, endIndex);
        return string.substring(realBeginIndex, realEndIndex);
    }


}{code}
The output is:
{noformat}
👦👩👪👫
invalid sub: ?
invalid sub: 👦
invalid sub: ??
real sub: 👦
real sub: 👦👩
real sub: 👩👪{noformat}
 

The same issues appear in Apache Commons Text's substring method.

Should Apache Commons Text use this code or something similar in the substring implementation,
rather than the flawed Java substring method? Or at least offer an additional utility method
that does take a string with unicode codepoints that require surrogate pairs and substrings
it correctly?

 

  was:
There are some major problems with Java's substring implementation which works using chars.
For a brief overview read this blog post: [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]

 

I have some demo code showing the issues and a possible solution here:
{code:java}
public class SubstringTest {
    public static void main(String[] args) {

        String stringWithPlus2ByteCodePoints = "👦👩👪👫";

        String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
        String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
        String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);

        System.out.println(stringWithPlus2ByteCodePoints);
        System.out.println("invalid sub: " + substring1);
        System.out.println("invalid sub: " + substring2);
        System.out.println("invalid sub: " + substring3);

        String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
        String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
        String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
        System.out.println("real sub: " + realSub1);
        System.out.println("real sub: " + realSub2);
        System.out.println("real sub: " + realSub3);
    }

    private static String getRealSubstring(String string, int beginIndex, int endIndex) {
        if (string == null)
            throw new IllegalArgumentException("String should not be null");
        int length = string.length();
        if (endIndex < 0 || beginIndex > endIndex || beginIndex >= length || endIndex
>= length)
            throw new IllegalArgumentException("Invalid indices");
        int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
        int realEndIndex = string.offsetByCodePoints(0, endIndex);
        return string.substring(realBeginIndex, realEndIndex);
    }


}{code}
The output is:
{noformat}
👦👩👪👫
invalid sub: ?
invalid sub: 👦
invalid sub: ??
real sub: 👦
real sub: 👦👩
real sub: 👩👪{noformat}
 

The same issues appear in Apache Commons Text's substring method.

Should Apache Commons Text use this code or something similar in the substring implementation,
rather than the flawed Java substring method? Or at least offer an additional utility method
that does take a string with unicode codepoints that require surrogate pairs and substrings
it correctly?

 


> Should there be a better implementation of substring that deals with Unicode surrogate
pairs correctly?
> -------------------------------------------------------------------------------------------------------
>
>                 Key: TEXT-161
>                 URL: https://issues.apache.org/jira/browse/TEXT-161
>             Project: Commons Text
>          Issue Type: New Feature
>    Affects Versions: 1.6
>         Environment: Any
>            Reporter: Sebastiaan
>            Priority: Minor
>              Labels: features
>
> There are some major problems with Java's substring implementation which works using
chars. For a brief overview read this blog post: [https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/]
>  
> I have some demo code showing the issues and a possible solution here:
> {code:java}
> public class SubstringTest {
>     public static void main(String[] args) {
>         String stringWithPlus2ByteCodePoints = "👦👩👪👫";
>         String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
>         String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
>         String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);
>         System.out.println(stringWithPlus2ByteCodePoints);
>         System.out.println("invalid sub: " + substring1);
>         System.out.println("invalid sub: " + substring2);
>         System.out.println("invalid sub: " + substring3);
>         String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
>         String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
>         String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
>         System.out.println("real sub: " + realSub1);
>         System.out.println("real sub: " + realSub2);
>         System.out.println("real sub: " + realSub3);
>     }
>     private static String getRealSubstring(String string, int beginIndex, int endIndex)
{
>         if (string == null)
>             throw new IllegalArgumentException("String should not be null");
>         int length = string.length();
>         if (endIndex < 0 || beginIndex > endIndex || beginIndex > length ||
endIndex > length)
>             throw new IllegalArgumentException("Invalid indices");
>         int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
>         int realEndIndex = string.offsetByCodePoints(0, endIndex);
>         return string.substring(realBeginIndex, realEndIndex);
>     }
> }{code}
> The output is:
> {noformat}
> 👦👩👪👫
> invalid sub: ?
> invalid sub: 👦
> invalid sub: ??
> real sub: 👦
> real sub: 👦👩
> real sub: 👩👪{noformat}
>  
> The same issues appear in Apache Commons Text's substring method.
> Should Apache Commons Text use this code or something similar in the substring implementation,
rather than the flawed Java substring method? Or at least offer an additional utility method
that does take a string with unicode codepoints that require surrogate pairs and substrings
it correctly?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message