tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Re: Create FileInputStream in servlet from remote file with accentuated character name
Date Tue, 22 Sep 2009 14:57:48 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 9/22/2009 4:00 AM, André Warnier wrote:
> what I am trying to say is that such matters are horrible, because
> *everything* matters.

Eh.. well, yeah. :)

> Your note about making sure, in the source code of the program, that the
> filename is really made out of the bytes which the OP thinks it is made
> of, is a good example. If, to create this program source, one uses an
> editor which is set to save its files in the iso-latin-1 charset, then
> "fichié.txt" will be saved, in the program source, as a string of 10
> bytes.  Conversely, if one uses an editor set to save its files in
> Unicode/UTF-8, then this same string will be saved as 11 bytes (the "é"
> occupying 2 bytes).
> Then comes the compiler..
> I don't know how a Java compiler handles source code respectively saved
> as an iso-8859-1 encoded file, or as a UTF-8 encoded file. How does it
> tell the difference ? does it make assumptions based on the locale it is
> running under ?

javac is documented to use the platform default encoding (for /Java/),
which may not be the default encoding of your editor. :(

http://java.sun.com/javase/6/docs/technotes/tools/windows/javac.html

Without any interference from me, my compiler chooses ANSI_X3.4-1968
which is roughly Latin-1, so any funny business in there like "Thử
nghiệm Tiếng Việt" isn't going to fly. It's always best in Java source
files to use something as close to ASCII as possible and use the \u
encoding of any special UNICODE characters.

The OP won't cough-up the source code, though, so we don't even know if
this is a source code problem or an HTTP-request-parameter
interpretation problem.

> One item of interest here would be to know how these files are created,
> and if that process is consistent (meaning, are these files always
> created by the same programs, running always on the same platform, using
> the same encoding etc..).  That is to make sure that when a file named
> "fichié.txt" is created there by whatever, it will always be created the
> same way, with a name of either 10 or 11 bytes (it does not matter
> which, just that it be consistent).

+1

> The problem is generally unsolvable, if the original entry in the
> directory can be created in several ways, because there are multiple
> agents capable of creating it, and these agents use inconsistent encodings.

Yup. Unless you read the directory entry from the filesystem and guess
at the right file (ha!), you might not get the one you want.

> However, the Tomcat server may well be started under a different locale
> setting, and this may have an impact as to how each one of them looks at
> the filename "fichié.txt".

Unfortunately, the Java API says nothing about the encoding used to read
and write filenames. :(

> Then of course, after the above trivial matter of the filename is
> resolved, one may have to tackle the matter of how the file contents are
> encoded.

At least the programmer has some measure of control over that.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkq45ewACgkQ9CaO5/Lv0PAVsQCgt9YnaEBJhRatVGgsUWjkmLlC
9yEAn03E+uM5bslLUZ1/sC4y3/3z1y0u
=pCP2
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message