Mailing-List: contact issues-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: issues@commons.apache.org
Date: Mon, 15 Aug 2011 20:17:27 +0000 (UTC)
From: "Sebb (JIRA)" <jira@apache.org>
To: issues@commons.apache.org
Message-ID: 
 <1238667846.39419.1313439447384.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <408671176.35844.1313235927977.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (CODEC-127) Non-ascii characters in source files
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CODEC-127?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13085=
301#comment-13085301 ]=20

Sebb commented on CODEC-127:
----------------------------

I think all the files are now fixed so that the code uses Unicode escapes; =
the only non-ASCII characters are now in comments.

> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly =
UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. c=
ause compilation errors, which is how I found the issue), and possibly some=
 transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-a=
scii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode =3D b64.decode("SGVsbG{=
=C2=B4=E2=94=90=C2=A2=C2=B4=E2=94=90=C2=A2=C2=B4=E2=94=90=C2=A2=C2=B4=E2=94=
=90=C2=A2=C2=B4=E2=94=90=C2=A2=C2=B4=E2=94=90=C2=A2}8gV29ybGQ=3D");
> language\ColognePhoneticTest.java:110             {"m=E2=94=9C=C3=82nchen=
gladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data =3D {{"berg=
isch-gladbach", "174845214"}, {"M=E2=94=9C=E2=95=9Dller-L=E2=94=9C=E2=95=9D=
denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M=E2=94=9C=
=E2=95=9Dller"},
> language\ColognePhoneticTest.java:143             {"ganz", "G=E2=94=9C=C3=
=B1nse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().=
isDoubleMetaphoneEqual("=C2=B4=E2=94=90=C2=A2", "S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().=
isDoubleMetaphoneEqual("=C2=B4=E2=94=90=C2=A2", "N");
> language\SoundexTest.java:367         if (Character.isLetter('=C2=B4=E2=
=94=90=C2=A2')) {
> language\SoundexTest.java:369                 Assert.assertEquals("=C2=B4=
=E2=94=90=C2=A2000", this.getSoundexEncoder().encode("=C2=B4=E2=94=90=C2=A2=
"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.ge=
tSoundexEncoder().encode("=C2=B4=E2=94=90=C2=A2"));
> language\SoundexTest.java:387         if (Character.isLetter('=C2=B4=E2=
=94=90=C2=A2')) {
> language\SoundexTest.java:389                 Assert.assertEquals("=C2=B4=
=E2=94=90=C2=A2000", this.getSoundexEncoder().encode("=C2=B4=E2=94=90=C2=A2=
"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.ge=
tSoundexEncoder().encode("=C2=B4=E2=94=90=C2=A2"));
> {code}
> The characters are probably not correct above, because I used a crude per=
l script to find them:
> {code}
> perl -ne "$.=3D1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=
=3D$ARGV;" xxxx.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's su=
pposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but=
 it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases=
, but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have al=
ways been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters ar=
e valid ISO-8859-1 (accented German), but given that the rest of the file u=
ses unicode escaps, I think they should be changed too (but add comments to=
 say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira