commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CODEC-127) Non-ascii characters in test source files
Date Sat, 13 Aug 2011 13:21:27 GMT

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084611#comment-13084611
] 

Sebb commented on CODEC-127:
----------------------------

The problem is that it's not possible to see what the test data is in the IDE (apart from
the German chars).

Also, unless you tell SVN the encoding (e.g. via mime-type), diff e-mails (and possibly conversion
to local EOL) may suffer.

Saving IDE settings in SVN is a non-starter, because there are many different IDEs, and it's
anyway not possible to have the settings automatically picked up, as far as I know.

Have a look again at the non-ISO-8858-1 characters and see if they are correct. I suspect
not, as they all appear to be the unspecified character (\ufffd), at least when treated as
UTF-8.

> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather
than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation
errors, which is how I found the issue), and possibly some transformations may corrupt the
contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach",
"174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "Gänse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢",
"S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢",
"N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to
find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be
a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably
not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1
(accented German), but given that the rest of the file uses unicode escaps, I think they should
be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message