Return-Path: X-Original-To: apmail-commons-issues-archive@minotaur.apache.org Delivered-To: apmail-commons-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EB260E21F for ; Mon, 10 Dec 2012 12:13:23 +0000 (UTC) Received: (qmail 97474 invoked by uid 500); 10 Dec 2012 12:13:22 -0000 Delivered-To: apmail-commons-issues-archive@commons.apache.org Received: (qmail 97029 invoked by uid 500); 10 Dec 2012 12:13:22 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 96991 invoked by uid 99); 10 Dec 2012 12:13:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Dec 2012 12:13:21 +0000 Date: Mon, 10 Dec 2012 12:13:21 +0000 (UTC) From: "Michael Houston (JIRA)" To: issues@commons.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (LANG-862) CharSequenceTranslator causes StringIndexOutOfBoundsException during translation of unicode codepoints with length > 1 character MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LANG-862?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D135278= 96#comment-13527896 ]=20 Michael Houston commented on LANG-862: -------------------------------------- Apologies, I see this is fixed in the latests SVN - should have browsed the= source code first! =20 > CharSequenceTranslator causes StringIndexOutOfBoundsException during tran= slation of unicode codepoints with length > 1 character > -------------------------------------------------------------------------= ------------------------------------------------------- > > Key: LANG-862 > URL: https://issues.apache.org/jira/browse/LANG-862 > Project: Commons Lang > Issue Type: Bug > Components: lang.text.translate.* > Affects Versions: 3.1 > Environment: OS X, Java 1.6 > Reporter: Michael Houston > Labels: bug, text, unicode > > When translating a string with unicode characters in, I've encountered an= index exception: > {code} > =09java.lang.StringIndexOutOfBoundsException: String index out of range: = 50 > =09at java.lang.String.charAt(String.java:686) > =09at java.lang.Character.codePointAt(Character.java:2335) > =09at org.apache.commons.lang3.text.translate.CharSequenceTranslator.tran= slate(CharSequenceTranslator.java:95) > =09at org.apache.commons.lang3.text.translate.CharSequenceTranslator.tran= slate(CharSequenceTranslator.java:59) > =09at org.apache.commons.lang3.StringEscapeUtils.escapeCsv(StringEscapeUt= ils.java:556) > =09... > {code} > The input string was from a twitter status: > org.apache.commons.lang3.StringEscapeUtils.escapeCsv("pink & black adidas= suit for this rainy weather \ud83d\udc4d=C2=8D"); > Both those characters are 'Invalid' unicode characters, so presumably the= re is a conversion error somewhere. However, this shouldn't cause the trans= lator to crash. > At line 94, the loop which generates the exception increments the positio= n by the size of the codepoint, which seems to grow faster than the number = of characters. I don't really know how codepoints work, but it looks to me = like there are two indexes which are treated as if they are the same one by= this loop: > * pt is incrementing by one character each iteration > * pos is incrementing by one or more characters each iteration > * pos is being used to index into the character array > * pt is the value actually being tested in the loop test, so pos can be = bigger than pt, causing an index problem at the end of the array > My guess would be that the loop should read something like: > {code} > for (int pt =3D 0; pt < consumed;) { > int count =3D Character.charCount(Character.codePointAt(i= nput, pos)); > pt +=3D count; > pos +=3D count; > } > {code} > I'm not sure if that was the intention, hope it makes some sense! > Stepping through that code with the input string " \ud83d\udc4d=C2=8D": > * the input string becomes " \ud83d\udc4d=C2=8D\u008d" (appended 'Reverse= Line Feed' - no idea why) > * consumed =3D=3D 4 > * Iterating the loop gives pt=3D0, pos=3D0 -> pt=3D1, pos=3D1 -> pt=3D2, = pos=3D3 -> pt-3, pos=3D4 (Index exception) > So \ud83d\udc4d seems to be a codepoint with a width of 2, which puts the= index off by one after that. > Anyway, hope that helps, > Regards, > Mike. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira