tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Re: Create FileInputStream in servlet from remote file with accentuated character name
Date Tue, 22 Sep 2009 15:33:37 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sylvie,

On 9/22/2009 11:01 AM, Sylvie Perrin wrote:
> The cause was the LC_ALL variable in my script starting tomcat.
> I set it to fr_FR.UTF-8 as you suggest and now, my test is OK !

I wonder if Java uses the file.encoding system property (which is set by
the portion of $LC_ALL after the .) to convert bytes returned from the
filesystem into filenames and vice versa.

Yeah, that appears to be the case:

import java.io.*;

public class FileEncodingTest
{
    public static void main(String[] args)
        throws Exception
    {
        System.out.println("Using file.encoding=" +
System.getProperty("file.encoding"));

        File file = new File("\u03c0"); // That's a lowercase Greek pi
        Writer out = new FileWriter(file);
        out.write("A test file\n");
        out.close();

        file = new File(".");

        File[] files = file.listFiles();

        for(int i=0; i<files.length; ++i)
        {
            file = files[i];

            System.out.print(file.getName());
            System.out.print("\tunicode: ");

            byte[] bytes =
file.getName().getBytes("UnicodeBigUnmarked"); // Trust me

            for(int j=0; j<bytes.length; ++j)
            {
                String hex = Integer.toHexString(bytes[j]);
                if(1 == hex.length())
                    System.out.print("0");
                System.out.print(hex);
                System.out.print(" ");
            }

            System.out.println();
        }
    }
}

Output on my system:

$ java FileEncodingTest
Using file.encoding=ANSI_X3.4-1968
FileEncodingTest.class  unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java   unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
?       unicode: 00 3f

$ LC_ALL=en_US.UTF-8 java FileEncodingTest
Using file.encoding=UTF-8
FileEncodingTest.class  unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java   unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
?       unicode: 00 3f
?       unicode: 03 c0  (/this correctly emitted the glyph for pi/)

Then, for good measure:

$ java FileEncodingTest
Using file.encoding=ANSI_X3.4-1968
FileEncodingTest.class  unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java   unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
?       unicode: 00 3f
??      unicode: ff fd ff fd (/this did not/)

So, when running in ANSI_X3.4-1968-mode, Java takes the codepoint for pi
(0x03c0) and destroys it (note the two-character filename where the
first byte is NUL). I'm not really even sure how it does that... I'd
have expected some broken sign-extension or something but I have no idea
how 0x03c0 becomes 0x003f.

When running in UTF-8 mode, the correct code point is used for the
filename and read-back correctly using listFiles.

When running again in ANSI mode, the original (incorrect) filename is
(predictably) read- back in the same way as the original, but the
filename with the correct code point is again garbled (0x03c0 ->
0xfffdfffd).

Somebody needs to write a virus that just converts everything to UTF-8
so we can be done with it.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkq47lAACgkQ9CaO5/Lv0PCDjwCfWTArE2PRo2XTeBgd3yGD+AyZ
dCUAnAo8aSsYUdgT/eJBvqMjWA0KzXwF
=OEyH
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message