commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Houston (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LANG-862) CharSequenceTranslator causes StringIndexOutOfBoundsException during translation of unicode codepoints with length > 1 character
Date Mon, 10 Dec 2012 12:13:21 GMT

    [ https://issues.apache.org/jira/browse/LANG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527896#comment-13527896
] 

Michael Houston commented on LANG-862:
--------------------------------------

Apologies, I see this is fixed in the latests SVN - should have browsed the source code first!
                
> CharSequenceTranslator causes StringIndexOutOfBoundsException during translation of unicode
codepoints with length > 1 character
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LANG-862
>                 URL: https://issues.apache.org/jira/browse/LANG-862
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.text.translate.*
>    Affects Versions: 3.1
>         Environment: OS X, Java 1.6
>            Reporter: Michael Houston
>              Labels: bug, text, unicode
>
> When translating a string with unicode characters in, I've encountered an index exception:
> {code}
> 	java.lang.StringIndexOutOfBoundsException: String index out of range: 50
> 	at java.lang.String.charAt(String.java:686)
> 	at java.lang.Character.codePointAt(Character.java:2335)
> 	at org.apache.commons.lang3.text.translate.CharSequenceTranslator.translate(CharSequenceTranslator.java:95)
> 	at org.apache.commons.lang3.text.translate.CharSequenceTranslator.translate(CharSequenceTranslator.java:59)
> 	at org.apache.commons.lang3.StringEscapeUtils.escapeCsv(StringEscapeUtils.java:556)
> 	...
> {code}
> The input string was from a twitter status:
> org.apache.commons.lang3.StringEscapeUtils.escapeCsv("pink & black adidas suit for
this rainy weather \ud83d\udc4d");
> Both those characters are 'Invalid' unicode characters, so presumably there is a conversion
error somewhere. However, this shouldn't cause the translator to crash.
> At line 94, the loop which generates the exception increments the position by the size
of the codepoint, which seems to grow faster than the number of characters. I don't really
know how codepoints work, but it looks to me like there are two indexes which are treated
as if they are the same one by this loop:
>  * pt is incrementing by one character each iteration
>  * pos is incrementing by one or more characters each iteration
>  * pos is being used to index into the character array
>  * pt is the value actually being tested in the loop test, so pos can be bigger than
pt, causing an index problem at the end of the array
> My guess would be that the loop should read something like:
> {code}
>             for (int pt = 0; pt < consumed;) {
>                 int count = Character.charCount(Character.codePointAt(input, pos));
>                 pt += count;
>                 pos += count;
>             }
> {code}
> I'm not sure if that was the intention, hope it makes some sense!
> Stepping through that code with the input string " \ud83d\udc4d":
> * the input string becomes " \ud83d\udc4d\u008d" (appended 'Reverse Line Feed' - no
idea why)
> * consumed == 4
> * Iterating the loop gives pt=0, pos=0 -> pt=1, pos=1 -> pt=2, pos=3 -> pt-3,
pos=4 (Index exception)
> So \ud83d\udc4d seems to be a codepoint with a width of 2, which puts the index off by
one after that.
> Anyway, hope that helps,
> Regards,
> Mike.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message