tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: Create FileInputStream in servlet from remote file with accentuated character name
Date Tue, 22 Sep 2009 08:00:55 GMT
Christopher Schultz wrote:
...
> 
> What is the source of that file name? Is it hard-coded into your Java
> code? If so, how? Did you just type "fichié.txt" into your .java file,
> or did you use "\uxyz" syntax to specify the UNICODE character you intended?
> 
> If you are reading the filename from a remote client, then all the
> request URI encodings and all that stuff are definitely relevant (ion
> spite of my previous statements to the contrary).
> 
...
> Honestly, I think the above should not be a problem. 
...
Christopher,

what I am trying to say is that such matters are horrible, because 
*everything* matters.

One cannot even be sure that the logfile message, as seen by the user 
and as pasted in the email to the list, and as further seen by the 
reader on this list, is really how the message is physically stored in 
the logfile.  That's because in-between, there can be umpteen layers of 
decoding/encoding which can make matters really confusing.
(Even the encoding used by the process which writes the logfile may 
matter, because "fichié.txt" may already have been re-encoded right there.)

Your note about making sure, in the source code of the program, that the 
filename is really made out of the bytes which the OP thinks it is made 
of, is a good example. If, to create this program source, one uses an 
editor which is set to save its files in the iso-latin-1 charset, then 
"fichié.txt" will be saved, in the program source, as a string of 10 
bytes.  Conversely, if one uses an editor set to save its files in 
Unicode/UTF-8, then this same string will be saved as 11 bytes (the "é" 
occupying 2 bytes).
Then comes the compiler..
I don't know how a Java compiler handles source code respectively saved 
as an iso-8859-1 encoded file, or as a UTF-8 encoded file. How does it 
tell the difference ? does it make assumptions based on the locale it is 
running under ?

About the creation and subsequent "finding" of a file :
Generally-speaking, filesystems are "encoding agnostic", in the precise 
sense that :
- if on a given platform and with a given programming language, you 
arrange for a string variable S to contain a precise series of bytes 
(for example, the UTF-8 encoding of the string "fichié.txt", 11 bytes long)
- if you then use that variable as the name of a file which you create 
on disk
- then no matter where this file directory ultimately resides, the name 
of the file in it will generally be these same exact 11 bytes.
- if you then, from the same platform and using the same programming 
languages, use this same variable A as the name of a file which you try 
to open, it will work.

However, as soon as you deviate from the strict case above, what looks 
to you like "fichié.txt" /may/ not be the same series of bytes anymore, 
and that's where the problems start.

How the filename will "look" like is however another matter, depending 
on what you use to display it and from where you do it.

In the case of Sylvie (and I am talking here about the final issue she 
is trying to handle, not just about the test case)

- presumably, some (other) users and/or applications, running on some 
(other) platform and using some (other) tools, are creating files inside 
of a Windows host's directory.
One item of interest here would be to know how these files are created, 
and if that process is consistent (meaning, are these files always 
created by the same programs, running always on the same platform, using 
the same encoding etc..).  That is to make sure that when a file named 
"fichié.txt" is created there by whatever, it will always be created the 
same way, with a name of either 10 or 11 bytes (it does not matter 
which, just that it be consistent).

- then, some program created by Sylvie, has to access that directory, 
and pick up files from there.  So this program may have to "know" how a 
filename "fichié.txt" will be encoded in that directory (either as 10 or 
11 bytes). It also does not matter which, as long as Sylvie's program 
has a way to consistently "spell" this name correctly.

The problem is generally unsolvable, if the original entry in the 
directory can be created in several ways, because there are multiple 
agents capable of creating it, and these agents use inconsistent encodings.

The issue can be simpler, if Sylvie's program just opens the directory, 
reads the filenames that it finds there (whatever their encoding is), 
into some variable, and then just uses this variable as the filename to 
open the file and that's it.
But if, in Sylvie's program, the filename itself has to be compared to 
some pre-defined other string stored in the program, and some action 
taken or not whether it is considered equal or not, then there may be a 
problem.

Yet another aspect to consider, is whether Sylvie is really testing the 
right thing.
For instance, when Sylvie runs her Java test program, she does this from 
inside a Linux session, which is set for a specific "locale".
However, the Tomcat server may well be started under a different locale 
setting, and this may have an impact as to how each one of them looks at 
the filename "fichié.txt".
(And also, as you mention, it depends how this string "fichié.txt" gets 
/into/ the program.)

Then of course, after the above trivial matter of the filename is 
resolved, one may have to tackle the matter of how the file contents are 
encoded.
:-)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message