tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: request.setCharacterEncoding() && request.getParameter()
Date Wed, 08 Jul 2009 16:14:14 GMT
Daniel Henrique Alves Lima wrote:
> 	IE is the best :-)
> "Note: The accept-charset attribute does not work properly in Internet
> Explorer. If accept-charset='ISO-8859-1', IE will send data encoded as
> 'Windows-1252'."
That is only one of the issues (browser inconsistencies).

If you want to really tackle this complex issue, you need to be 
systematic, make sure you understand the bits and pieces, and do 
everything right.
A short overview :

1) choose Unicode/UTF-8 as your charset/encoding, for *everything*. 
Don't try to mix and match, or you'll get in trouble. Promise.

Applying #1 above :

2) find out the available "locales" on the Linux host where you run this 
"locale -a | more"
Pick one locale that has "utf8" in the name, note its name.
In the system script that starts Tomcat, add
export LC_ALL="pt_PT.utf8@euro"
(or whichever locale you have chosen)
That sets the "system locale" for the JVM that runs Tomcat, and is a way 
to make it independent from whatever may be the system's configured 
"default locale".

3) All your html pages should have a declaration like :
<meta http-equiv="content-type" value="text/html; charset=UTF-8" />

4) All your html <form> tags should have an attribute :

5) a URL is in no particular charset.  A URL is *bytes*.
Any byte in a URL, that is not (generally speaking) such that it can be 
represented by an ASCII letter a-zA-Z0-9, will be encoded as %xy, where 
xy is the hexadecimal representation of this byte.
After decoding these %xy things, the result is again bytes, and that's 
how your application sees it.

6) In your application, you can decide to interpret this series of 
bytes, as a string in the UTF-8 encoding, and decode it as such into 
Unicode *characters*.
Forget about any parameters to specify the charset of URLs, they only 
confuse things totally.
The only way you know what was the underlying encoding, is when you know 
for sure that all URLs that will hit your server, come from a known 
source of which you controlled the encoding.

7) When submitting the values of the <input> tags of a form, browsers 
will generally respect the basic encoding of the html page in which the 
form was included, and (usually) also the "accept-charset" attribute.
By specifying both, you almost always win, as long as the submitted form 
comes from your application, and has the right encoding.

8) In theory, you should also make sure that all responses sent by your 
server to a browser, if they are html pages, contain the correct HTTP 
header :
Content-type: text/html; charset=UTF-8
That, you can check with a browser add-on such as
- LiveHttpHeader for Firefox
- Fiddler2 for IE
and examine what goes out and what comes in.
You can also use Wireshark.
The good news is that most webservers do this correctly.
The bad news is that IE usually ignores this header, and makes its own 
decision as to what the content is.  Newer IE versions may be better.

9) Java's internal charset is Unicode.
So when you do request.getParameter(), you will always get what Java 
considers to be the proper Unicode translation of how the parameter came in.
The problem is to not let Java get confused about what it receives from 
the browser.  By doing all the above, you minimise the chances that it 
will be confused.

10) If you want to really make sure, include in all your forms some 
hidden input value, containing a known string with "accented" characters 
(áàéèÜÖ and such).
Then, before you process any other parameter in your webapp, check if 
that string matches one that you have defined in your servlet.
If it does not, then something is wrong.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message