tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Create FileInputStream in servlet from remote file with accentuated character name
Date Thu, 24 Sep 2009 21:26:54 GMT
Sylvie Perrin wrote:
> Christopher, André,
> Christopher Schultz a écrit :
>>> And (just to anticipate the next issue), Sylvie, does your program
>>> actually need to read the content of the file and do something with that
>>> content ?
>> Yeah, remember to use a Reader and specify the character encoding.
> Yes, my program needs to do something with the content of files of the 
> shared Windows directory.
> Actually, the main action is to parse each files and read content 
> throught an "InputStreamReader(new FileInputStream(file))".
> According to what Christopher says, I need to always specify the 
> character encoding, so doing "InputStreamReader(new 
> FileInputStream(file), encoding)"
If you know that all the files dropped there will be UTF-8 encoded, then 
specify UTF-8 as the encoding.
The problem is that, if you do not control who puts files there or how, 
then at some point you may encounter a file whose content is encoded in, 
say, iso-8859-1 instead of UTF-8.  In that case, at some point your 
InputStreamReader may trigger an exception (when it encounters something 
that is not valid UTF-8).
You have to be prepared to deal with that.

The general point of this all is : as long as the whole computing world 
will not have agreed to use Unicode/UTF-8 encoding everywhere (in 
directories, in text files, in URLs, in program source code,..), dealing 
with a priori unknown directory entries and text files is messy, and 
without additional constraints on the clients or additional information 
provided separately, there is no 100% sure way to determine what you are 
going to get.

If as you indicate above, you are being asked to "parse" these files, 
there I suppose that they must have some pre-defined form.  Does that 
form also impose a given character set and encoding ? If not yet, I 
strongly suggest that you try to add this to the requirements, because 
otherwise the application will be unreliable.  Not because your programs 
would be bad, but because it is just impossible to be 100% reliable in 
such cases.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message