Return-Path: X-Original-To: apmail-commons-issues-archive@minotaur.apache.org Delivered-To: apmail-commons-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 74DA37ED7 for ; Mon, 15 Aug 2011 20:17:50 +0000 (UTC) Received: (qmail 57850 invoked by uid 500); 15 Aug 2011 20:17:50 -0000 Delivered-To: apmail-commons-issues-archive@commons.apache.org Received: (qmail 57764 invoked by uid 500); 15 Aug 2011 20:17:49 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 57749 invoked by uid 99); 15 Aug 2011 20:17:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2011 20:17:49 +0000 X-ASF-Spam-Status: No, hits=-2001.1 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2011 20:17:47 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 5E9DABD3FD for ; Mon, 15 Aug 2011 20:17:27 +0000 (UTC) Date: Mon, 15 Aug 2011 20:17:27 +0000 (UTC) From: "Sebb (JIRA)" To: issues@commons.apache.org Message-ID: <1238667846.39419.1313439447384.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <408671176.35844.1313235927977.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (CODEC-127) Non-ascii characters in source files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CODEC-127?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13085= 301#comment-13085301 ]=20 Sebb commented on CODEC-127: ---------------------------- I think all the files are now fixed so that the code uses Unicode escapes; = the only non-ASCII characters are now in comments. > Non-ascii characters in source files > ------------------------------------ > > Key: CODEC-127 > URL: https://issues.apache.org/jira/browse/CODEC-127 > Project: Commons Codec > Issue Type: Bug > Reporter: Sebb > > Some of the test cases include characters in a native encoding (possibly = UTF-8), rather than using Unicode escapes. > This can cause a problem for IDEs if they don't know the encoding (e.g. c= ause compilation errors, which is how I found the issue), and possibly some= transformations may corrupt the contents, e.g. fixing EOL. > I think we should have a rule of using Unicode escapes for all such non-a= scii characters. > It's particularly important for non-ISO-8859-1 characters. > Some example classes with non-ascii characters: > {code} > binary\Base64Test.java:96 byte[] decode =3D b64.decode("SGVsbG{= =C2=B4=E2=94=90=C2=A2=C2=B4=E2=94=90=C2=A2=C2=B4=E2=94=90=C2=A2=C2=B4=E2=94= =90=C2=A2=C2=B4=E2=94=90=C2=A2=C2=B4=E2=94=90=C2=A2}8gV29ybGQ=3D"); > language\ColognePhoneticTest.java:110 {"m=E2=94=9C=C3=82nchen= gladbach", "664645214"}, > language\ColognePhoneticTest.java:130 String[][] data =3D {{"berg= isch-gladbach", "174845214"}, {"M=E2=94=9C=E2=95=9Dller-L=E2=94=9C=E2=95=9D= denscheidt", "65752682"}}; > language\ColognePhoneticTest.java:137 {"Meyer", "M=E2=94=9C= =E2=95=9Dller"}, > language\ColognePhoneticTest.java:143 {"ganz", "G=E2=94=9C=C3= =B1nse"}, > language\DoubleMetaphoneTest.java:1222 this.getDoubleMetaphone().= isDoubleMetaphoneEqual("=C2=B4=E2=94=90=C2=A2", "S"); > language\DoubleMetaphoneTest.java:1227 this.getDoubleMetaphone().= isDoubleMetaphoneEqual("=C2=B4=E2=94=90=C2=A2", "N"); > language\SoundexTest.java:367 if (Character.isLetter('=C2=B4=E2= =94=90=C2=A2')) { > language\SoundexTest.java:369 Assert.assertEquals("=C2=B4= =E2=94=90=C2=A2000", this.getSoundexEncoder().encode("=C2=B4=E2=94=90=C2=A2= ")); > language\SoundexTest.java:375 Assert.assertEquals("", this.ge= tSoundexEncoder().encode("=C2=B4=E2=94=90=C2=A2")); > language\SoundexTest.java:387 if (Character.isLetter('=C2=B4=E2= =94=90=C2=A2')) { > language\SoundexTest.java:389 Assert.assertEquals("=C2=B4= =E2=94=90=C2=A2000", this.getSoundexEncoder().encode("=C2=B4=E2=94=90=C2=A2= ")); > language\SoundexTest.java:395 Assert.assertEquals("", this.ge= tSoundexEncoder().encode("=C2=B4=E2=94=90=C2=A2")); > {code} > The characters are probably not correct above, because I used a crude per= l script to find them: > {code} > perl -ne "$.=3D1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s= =3D$ARGV;" xxxx.java > {code} > language\SoundexTest.java:367 in particular is incorrect, because it's su= pposed to be a single character. > Now one might think that native2ascii -encoding UTF-8 would fix that, but= it gives: > if (Character.isLetter('\ufffd')) > which is an "unknown" character. > Similarly for binary\Base64Test.java:96. > It's not all that clear what the Unicode escapes should be in these cases= , but probably not the unknown character. > [Possibly the characters got mangled at some point, or maybe they have al= ways been wrong] > The ColognePhoneticTest.java cases are less serious, as the characters ar= e valid ISO-8859-1 (accented German), but given that the rest of the file u= ses unicode escaps, I think they should be changed too (but add comments to= say what they are, e.g. o-umlaut, u-umlaut) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira