tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: request.setCharacterEncoding() && request.getParameter()
Date Wed, 08 Jul 2009 20:56:30 GMT
Daniel Henrique Alves Lima wrote

On Wed, 2009-07-08 at 18:14 +0200, André Warnier wrote:
>> 6) In your application, you can decide to interpret this series of 
>> bytes, as a string in the UTF-8 encoding, and decode it as such into 
>> Unicode *characters*.
>> Forget about any parameters to specify the charset of URLs, they only 
>> confuse things totally.
>> The only way you know what was the underlying encoding, is when you know 
>> for sure that all URLs that will hit your server, come from a known 
>> source of which you controlled the encoding.
> ?
To use an example :

Suppose you give me the URL to your webapp, and it is

Suppose I use this URL, and add a query string, so that it arrives to 
your server as a GET request for

then, you have absolutely no way, after URL-decoding the above into a 
series of bytes, to know under which character set I actually composed 
that query string.

It /could be/, that the sequence %c4%20 that you see above, is actually 
the UTF-8 encoding of a single Unicode character.(**)

But it could also be that in fact it is the two iso-8859-1 characters 
"Ä" and "space".
And it could also be that, together with the "x" which follows, it is 
the tri-byte encoding of the Klingon symbol for breakfast.(*)

In order to decide on an interpretation of that query string using a 
certain character set and encoding, you would have to know something 
about me and my browser, which on the WWW you don't know.

The only way you could /assume/ a certain character set and encoding, 
would be if this request could only originate from a page that your 
application sent to my browser beforehand, in which you have done your 
best to ensure that whatever "click" results in a request with a known 
charset and encoding.
That's why all the previous details are important.

Note that some people variously assume that a HTTP URL is necessarily 
expressed in US-ASCII, or iso-latin-1, or UTF-8.
They are generally mistaken, as per

So, let me add an item to the previous shortlist :

11) in html <form> elements, always specify the attribute
This way, form input elements will be passed in the /body/ of the HTTP 
request (and not in the URL, like in my GET example above).
At least for the body of a HTTP request, the browser can, and /should/ 
send charset/encoding information allowing the server to know how the 
submitted parameters are encoded.

There seems to be a recent /tendency/ for browsers to use UTF-8 for 
encoding request URLs, but it is by no means yet a universal thing.
(In IE for instance, it is a setting that must be turned on in "Internet 

(*) This is a little-known fact, but there exists in fact a Klingon 
relay station on Earth connected to our Internet, and the Klingons in 
their spaceships use it from time to time to access Wikipedia and have a 
good laugh.  Their keyboards and browsers are different from ours of course.

(**) and I bet someone is going to get back here and say that this 
cannot possibly be a valid UTF-8 sequence.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message