tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: Tomcat default encoding character ? Dfile.encoding option mean ?
Date Wed, 08 Oct 2008 00:37:15 GMT
albrecht andrzejewski wrote:
> I ran accros the ml archives, and i find some useful posts.
> 
> I've almost solved my problem: i can now display the accent (é è à) 
> using   request.setCharacterEncoding("UTF-8");
> response.setCharacterEncoding("UTF-8");
> 
> It seems that the default charset for tomcat is ISO 8859 1
> The j2ee javadoc says:
> 
> "If no charset is specified, ISO-8859-1 will be used."
> 
> I was pretty sure that tomcat handles UTF-8 by default, but it's not the 
> case...at least for HttpServletResponse objects. Anyway, do you know if 
> it's possible to set up a default charset for the wjole tomcat response, 
> instead of calling these two methods every time a request reach the 
> servlet... ?
> 
> I tried to define the CATALINA_OPTS, but perhaps the file encoding is 
> different from the request/response encoding.
> CATALINA_OPTS="-Dfile.encoding=UTF-8"
> export LC_ALL CATALINA_OPTS
> 

Take the following with caution, because I do not really know the 
underlying reason in Tomcat :

I have found that setting the LC_CTYPE environment variable to a UTF-8 
"locale" (or inversely, to a ISO-8859-1 locale) prior to starting Tomcat 
influences the way in which *some* servlets are reading request bodies 
and/or encoding request responses.
You can do this in the startup.sh script, or probably more correctly in 
the setenv.sh script, in the Tomcat/bin directory (that is, if your 
Tomcat is "the" canonical distribution; if your Tomcat comes from a 
pre-packaged version, it may not use these scripts for startup).
Make sure to use a valid and installed locale.
do
locale -a
choose in the list an installed locale which fits and says "utf8" in the 
name and add it to the script (for example) :
LC_CTYPE="en_US.utf8"; export LC_CTYPE
prior to starting Tomcat.

(in the above, I am assuming Unix/Linux; under Windows it may not be 
feasible).

One reason to be careful with this anyway, is that it may have 
unexpected consequences on other servlets.
I believe this happens when the servlet itself is not specifying 
explicitly the encoding it uses for reading the request body or writing 
the response, and the JVM then defaults to the locale setting of the 
process that runs it and Tomcat.

In other words, in my opinion your solution above of setting this 
explicitly in your servlet is the better one.

Also make sure that all the html pages that you serve contain a tag like
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

If your html pages contain <form> tags, and you would like the browser 
to be nice and send you proper UTF-8 encoded form values when posting a 
form content, then add the following attributes to them, to try and 
convince the browser to do the right thing :
<form .. method="POST" enctype="multipart/form-data" 
accept-charset="UTF-8">

And then, if you design and edit your html pages yourself, make sure 
that you use an editor that supports UTF-8, and save your pages as such.

And then, verify at the browser level (for example with Firefox and the 
LiveHttpHeaders extension), that the browser is effectively receiving a 
HTTP header like
Content-Type: text/html; charset=UTF-8
with every response from your server.

Paranoia : since you cannot trust the user nor his browser anyway, you 
may still want to add in your <form>s a hidden input field, containing a 
set value that is a known string in UTF-8 with some accented characters. 
  Then in your application, you could check if you really received that 
string as expected.  If not, then something unexpected happened with the 
form encoding, and you should reject the data. Something thus like :
<input type="hidden" value="ÁlélÜìÄ">
which will have a different "string length" depending on whether it is 
encoded as UTF-8 or iso-8859-1 (an "é" is 1 byte in iso-8859-1, but 2 
bytes in Unicode/UTF-8).
That is not really paranoia, it's experience.

That was the practical bit. If you more general theorising, keep reading.

In general, for historical reasons mostly, the default charset/encoding 
for HTML and HTTP is ISO-8859-1 (latin-1).
This is not always clear in all RFCs that contribute to various aspects 
of web applications however, so there is a certain amount of confusion. 
  For example, the RFCs concerning HTML are quite clear (iso-8859-1 by 
default), while the RFCs concerning HTTP URIs are more vague or mutually 
contradictory.
In any case, it is (unfortunately) not Unicode/UTF-8 everywhere by 
default, despite the hopes and beliefs of some web developers.

The fact that the internal Java charset is Unicode, and its default 
external charset/encoding is Unicode/UTF-8, tends to comfort some 
Java/Tomcat developers in the false belief that URLs also by default are 
UTF-8, while they are not (as far as I can determine, they are 
encoding-neutral).

Some people also believe that UTF-8 and iso-8859-1 are identical anyway 
for the first 256 Unicode code points, so it doesn't really matter.  But 
this is also incorrect (only the first 128 codes overlap), and it does 
matter for anyone trying to build an application that is not purely 
English-speaking, as you have noticed.

And finally, there seems to be some confusion between a parameter that 
specifies a default encoding for Tomcat's internal processing of URIs, 
with the request body or response body encoding.  There is also a 
parameter I believe that specifies something like "use the body encoding 
for the URL also" or vice-versa.

Add to this, that users can set up their browser in various ways, that 
they may have various keyboards and operating systems, that some 
browsers disregard what the server says about documents anyway and think 
they are smarter, and you get the situation that exists currently on the 
web, where half the time I cannot enter my first name in a web form and 
see it returned to me correctly in a response or an email.  And I guess 
you may not be faring much better with your last name..

Tout cela ne simplifie pas les choses, mais...

The good news is that it appears to be improving over time, with correct 
UTF-8 support now in all browsers, and a tendency by web developers to 
specify UTF-8 explicitly wherever it's needed.
Which is many places, if you really want to get all the chips on your side.







---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message