From issues-return-31168-apmail-commons-issues-archive=commons.apache.org@commons.apache.org Mon Dec 10 12:07:24 2012 Return-Path: X-Original-To: apmail-commons-issues-archive@minotaur.apache.org Delivered-To: apmail-commons-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 24987E1F4 for ; Mon, 10 Dec 2012 12:07:24 +0000 (UTC) Received: (qmail 84629 invoked by uid 500); 10 Dec 2012 12:07:23 -0000 Delivered-To: apmail-commons-issues-archive@commons.apache.org Received: (qmail 84355 invoked by uid 500); 10 Dec 2012 12:07:22 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 84322 invoked by uid 99); 10 Dec 2012 12:07:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Dec 2012 12:07:21 +0000 Date: Mon, 10 Dec 2012 12:07:20 +0000 (UTC) From: "Michael Houston (JIRA)" To: issues@commons.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (LANG-862) CharSequenceTranslator causes StringIndexOutOfBoundsException during translation of unicode codepoints with length > 1 character MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LANG-862?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Houston updated LANG-862: --------------------------------- Description:=20 When translating a string with unicode characters in, I've encountered an i= ndex exception: {code} =09java.lang.StringIndexOutOfBoundsException: String index out of range: 50 =09at java.lang.String.charAt(String.java:686) =09at java.lang.Character.codePointAt(Character.java:2335) =09at org.apache.commons.lang3.text.translate.CharSequenceTranslator.transl= ate(CharSequenceTranslator.java:95) =09at org.apache.commons.lang3.text.translate.CharSequenceTranslator.transl= ate(CharSequenceTranslator.java:59) =09at org.apache.commons.lang3.StringEscapeUtils.escapeCsv(StringEscapeUtil= s.java:556) =09... {code} The input string was from a twitter status: org.apache.commons.lang3.StringEscapeUtils.escapeCsv("pink & black adidas s= uit for this rainy weather \ud83d\udc4d=C2=8D"); Both those characters are 'Invalid' unicode characters, so presumably there= is a conversion error somewhere. However, this shouldn't cause the transla= tor to crash. At line 94, the loop which generates the exception increments the position = by the size of the codepoint, which seems to grow faster than the number of= characters. I don't really know how codepoints work, but it looks to me li= ke there are two indexes which are treated as if they are the same one by t= his loop: * pt is incrementing by one character each iteration * pos is incrementing by one or more characters each iteration * pos is being used to index into the character array * pt is the value actually being tested in the loop test, so pos can be bi= gger than pt, causing an index problem at the end of the array My guess would be that the loop should read something like: {code} for (int pt =3D 0; pt < consumed;) { int count =3D Character.charCount(Character.codePointAt(inp= ut, pos)); pt +=3D count; pos +=3D count; } {code} I'm not sure if that was the intention, hope it makes some sense! Stepping through that code with the input string " \ud83d\udc4d=C2=8D": * the input string becomes " \ud83d\udc4d=C2=8D\u008d" (appended 'Reverse L= ine Feed' - no idea why) * consumed =3D=3D 4 * Iterating the loop gives pt=3D0, pos=3D0 -> pt=3D1, pos=3D1 -> pt=3D2, po= s=3D3 -> pt-3, pos=3D4 (Index exception) So \ud83d\udc4d seems to be a codepoint with a width of 2, which puts the i= ndex off by one after that. Anyway, hope that helps, Regards, Mike. was: When translating a string with unicode characters in, I've encountered an i= ndex exception: =09java.lang.StringIndexOutOfBoundsException: String index out of range: 50 =09at java.lang.String.charAt(String.java:686) =09at java.lang.Character.codePointAt(Character.java:2335) =09at org.apache.commons.lang3.text.translate.CharSequenceTranslator.transl= ate(CharSequenceTranslator.java:95) =09at org.apache.commons.lang3.text.translate.CharSequenceTranslator.transl= ate(CharSequenceTranslator.java:59) =09at org.apache.commons.lang3.StringEscapeUtils.escapeCsv(StringEscapeUtil= s.java:556) =09... The input string was from a twitter status: org.apache.commons.lang3.StringEscapeUtils.escapeCsv("pink & black adidas s= uit for this rainy weather \ud83d\udc4d=C2=8D"); Both those characters are 'Invalid' unicode characters, so presumably there= is a conversion error somewhere. However, this shouldn't cause the transla= tor to crash. At line 94, the loop which generates the exception increments the position = by the size of the codepoint, which seems to grow faster than the number of= characters. I don't really know how codepoints work, but it looks to me li= ke there are two indexes which are treated as if they are the same one by t= his loop: pt is incrementing by one character each iteration pos is incrementing by one or more characters each iteration pos is being used to index into the character array pt is the value actually being tested in the loop test, so pos can be bigge= r than pt, causing an index problem at the end of the array My guess would be that the loop should read something like: for (int pt =3D 0; pt < consumed;) { int count =3D Character.charCount(Character.codePointAt(inp= ut, pos)); pt +=3D count; pos +=3D count; } I'm not sure if that was the intention, hope it makes some sense! Stepping through that code with the input string " \ud83d\udc4d=C2=8D": * the input string becomes " \ud83d\udc4d=C2=8D\u008d" (appended 'Reverse L= ine Feed' - no idea why) * consumed =3D=3D 4 * Iterating the loop gives pt=3D0, pos=3D0 -> pt=3D1, pos=3D1 -> pt=3D2, po= s=3D3 -> pt-3, pos=3D4 (Index exception) So \ud83d\udc4d seems to be a codepoint with a width of 2, which puts the i= ndex off by one after that. Anyway, hope that helps, Regards, Mike. =20 > CharSequenceTranslator causes StringIndexOutOfBoundsException during tran= slation of unicode codepoints with length > 1 character > -------------------------------------------------------------------------= ------------------------------------------------------- > > Key: LANG-862 > URL: https://issues.apache.org/jira/browse/LANG-862 > Project: Commons Lang > Issue Type: Bug > Components: lang.text.translate.* > Affects Versions: 3.1 > Environment: OS X, Java 1.6 > Reporter: Michael Houston > Labels: bug, text, unicode > > When translating a string with unicode characters in, I've encountered an= index exception: > {code} > =09java.lang.StringIndexOutOfBoundsException: String index out of range: = 50 > =09at java.lang.String.charAt(String.java:686) > =09at java.lang.Character.codePointAt(Character.java:2335) > =09at org.apache.commons.lang3.text.translate.CharSequenceTranslator.tran= slate(CharSequenceTranslator.java:95) > =09at org.apache.commons.lang3.text.translate.CharSequenceTranslator.tran= slate(CharSequenceTranslator.java:59) > =09at org.apache.commons.lang3.StringEscapeUtils.escapeCsv(StringEscapeUt= ils.java:556) > =09... > {code} > The input string was from a twitter status: > org.apache.commons.lang3.StringEscapeUtils.escapeCsv("pink & black adidas= suit for this rainy weather \ud83d\udc4d=C2=8D"); > Both those characters are 'Invalid' unicode characters, so presumably the= re is a conversion error somewhere. However, this shouldn't cause the trans= lator to crash. > At line 94, the loop which generates the exception increments the positio= n by the size of the codepoint, which seems to grow faster than the number = of characters. I don't really know how codepoints work, but it looks to me = like there are two indexes which are treated as if they are the same one by= this loop: > * pt is incrementing by one character each iteration > * pos is incrementing by one or more characters each iteration > * pos is being used to index into the character array > * pt is the value actually being tested in the loop test, so pos can be = bigger than pt, causing an index problem at the end of the array > My guess would be that the loop should read something like: > {code} > for (int pt =3D 0; pt < consumed;) { > int count =3D Character.charCount(Character.codePointAt(i= nput, pos)); > pt +=3D count; > pos +=3D count; > } > {code} > I'm not sure if that was the intention, hope it makes some sense! > Stepping through that code with the input string " \ud83d\udc4d=C2=8D": > * the input string becomes " \ud83d\udc4d=C2=8D\u008d" (appended 'Reverse= Line Feed' - no idea why) > * consumed =3D=3D 4 > * Iterating the loop gives pt=3D0, pos=3D0 -> pt=3D1, pos=3D1 -> pt=3D2, = pos=3D3 -> pt-3, pos=3D4 (Index exception) > So \ud83d\udc4d seems to be a codepoint with a width of 2, which puts the= index off by one after that. > Anyway, hope that helps, > Regards, > Mike. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira