commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (CODEC-127) Non-ascii characters in test source files
Date Sun, 14 Aug 2011 00:05:27 GMT

    [ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084743#comment-13084743
] 

Sebb edited comment on CODEC-127 at 8/14/11 12:04 AM:
------------------------------------------------------

Here's the full list of lines containing non-ASCII characters:

{code}
java/org/apache/commons/codec/language/ColognePhonetic.java:264    private static final char[][]
PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
java/org/apache/commons/codec/language/ColognePhonetic.java:265        {'\u00DC', 'U'}, //
Ü
java/org/apache/commons/codec/language/ColognePhonetic.java:266        {'\u00D6', 'O'}, //
├âÔÇô
java/org/apache/commons/codec/language/ColognePhonetic.java:267        {'\u00DF', 'S'} //
├â┼©
java/org/apache/commons/codec/language/ColognePhonetic.java:388     * Converts the string
to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
test/org/apache/commons/codec/binary/Base64Test.java:96        byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
test/org/apache/commons/codec/language/ColognePhoneticTest.java:110            {"m├Ânchengladbach",
"664645214"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:130        String[][] data
= {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
test/org/apache/commons/codec/language/ColognePhoneticTest.java:137            {"Meyer", "M├╝ller"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:143            {"ganz", "Gänse"},
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢",
"S");
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢",
"N");
test/org/apache/commons/codec/language/SoundexTest.java:367        if (Character.isLetter('´┐¢'))
{
test/org/apache/commons/codec/language/SoundexTest.java:369                Assert.assertEquals("´┐¢000",
this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:375            Assert.assertEquals("",
this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:387        if (Character.isLetter('´┐¢'))
{
test/org/apache/commons/codec/language/SoundexTest.java:389                Assert.assertEquals("´┐¢000",
this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:395            Assert.assertEquals("",
this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93        String[] names
= { "ácz", "átz", "Ignácz", "Ignátz", "Ignác" };
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47                { "Nu├▒ez",
"spanish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49                { "─îapek",
"czech", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52                { "Küçük",
"turkish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55                { "Ceauşescu",
"romanian", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57                { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é",
"greek", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58                { "ðƒÐâÐêð║ð©ð¢",
"cyrillic", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59                { "ÎøÎö΃",
"hebrew", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60                { "ácz",
"any", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61                { "átz",
"any", EXACT } });
{code}

Note the comment at ColognePhonetic.java:388 - this does not seem to make sense in any encoding,
but I could be wrong.
[You'll need to look at it in the source file itself - the Perl script I used is crude and
does not display non-ASCII properly]

The other dubious entris are:

Base64Test.java:96
DoubleMetaphoneTest.java:1222
DoubleMetaphoneTest.java:1227
and most of the SoundexTest.java entries.

      was (Author: sebb@apache.org):
    Here's the full list of lines containing non-ASCII characters:

{code}
java/org/apache/commons/codec/language/ColognePhonetic.java:264    private static final char[][]
PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
java/org/apache/commons/codec/language/ColognePhonetic.java:265        {'\u00DC', 'U'}, //
Ü
java/org/apache/commons/codec/language/ColognePhonetic.java:266        {'\u00D6', 'O'}, //
├âÔÇô
java/org/apache/commons/codec/language/ColognePhonetic.java:267        {'\u00DF', 'S'} //
├â┼©
java/org/apache/commons/codec/language/ColognePhonetic.java:388     * Converts the string
to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
test/org/apache/commons/codec/binary/Base64Test.java:96        byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
test/org/apache/commons/codec/language/ColognePhoneticTest.java:110            {"m├Ânchengladbach",
"664645214"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:130        String[][] data
= {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
test/org/apache/commons/codec/language/ColognePhoneticTest.java:137            {"Meyer", "M├╝ller"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:143            {"ganz", "Gänse"},
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢",
"S");
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢",
"N");
test/org/apache/commons/codec/language/SoundexTest.java:367        if (Character.isLetter('´┐¢'))
{
test/org/apache/commons/codec/language/SoundexTest.java:369                Assert.assertEquals("´┐¢000",
this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:375            Assert.assertEquals("",
this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:387        if (Character.isLetter('´┐¢'))
{
test/org/apache/commons/codec/language/SoundexTest.java:389                Assert.assertEquals("´┐¢000",
this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:395            Assert.assertEquals("",
this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93        String[] names
= { "ácz", "átz", "Ignácz", "Ignátz", "Ignác" };
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47                { "Nu├▒ez",
"spanish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49                { "─îapek",
"czech", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52                { "Küçük",
"turkish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55                { "Ceauşescu",
"romanian", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57                { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é",
"greek", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58                { "ðƒÐâÐêð║ð©ð¢",
"cyrillic", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59                { "ÎøÎö΃",
"hebrew", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60                { "ácz",
"any", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61                { "átz",
"any", EXACT } });
{code}

Note the comment at ColognePhonetic.java:388 - this does not seem to make sense in any encoding,
but I could be wrong.
  
> Non-ascii characters in test source files
> -----------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly UTF-8), rather
than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation
errors, which is how I found the issue), and possibly some transformations may corrupt the
contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach",
"174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "Gänse"},
> language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢",
"S");
> language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢",
"N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl script to
find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be
a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, but probably
not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1
(accented German), but given that the rest of the file uses unicode escaps, I think they should
be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message