tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: POST request encoding - Tomcat/JVM configuration?
Date Sat, 24 Oct 2009 12:30:02 GMT
Pfeifer Jan wrote:

> I know about URIEncoding in server.xml and about using Encoding filter,but we use this
for decoding GET request for historical reasons. Or is there more "correct"
way to decode String? 

this whole area of the character set in which HTTP requests come into a 
server, and are decoded by the server, is complicated, confusing, and 
generally not well-defined (or defined in contradictory ways) by the 
Internet RFCs themselves.
In short, there can be many reasons why you are not getting the data in 
the character set that you expect, and finding the specific reason that 
applies in your case can be tedious and involve several levels.
To resolve it, you have to be very systematic, and check every step one 
by one.
Here are some principles :

1) the general "default" for the HTTP protocol, and for HTML, is 
iso-8859-1.  Anything else, you have to explicitly specify.
iso-8859-1 is at the same time a character set, and an encoding, in 
which each character is represented by one byte.

2) internally, Java represents all character strings as Unicode (which 
is a character set), using a 16-bit representation for each character 
(which is an encoding).

(1) and (2) above mean that somewhere, no matter what, some character 
set translation is going to take place, between "the web" and your Java 
webapp, and vice-versa between your webapp and the web.  The trick is to 
get the pieces in place so that the /correct/ translations take place in 
each direction.

3) iso-8859-1 (in fact all iso-8859-x character sets and encodings) can 
only represent each 256 different characters, which is not enough to 
cover all languages used on the WWW nowadays. So if your applications 
have to use Czech and German at the same time, you should not use a 
iso-8859 charset.

4) UTF-8 is a popular encoding of Unicode, where each character is 
represented by one or more bytes.
The big advantage of Unicode/UTF-8 is that it can represent all 
characters of all languages used on the WWW.
The inconvenient of Unicode/UTF-8 at the moment is that, for historical 
reasons, it is /not/ the HTTP/HTML default charset, so you have to 
explicitly specify it in several places.

5) despite what is said above about the default for HTTP being 
iso-8859-1, URLs are an exception.  A URL, by definition, is not in any 
specific character set or encoding.  The definition of URLs just says 
that, whatever the character set and encoding used, *any byte whose 
value does not match one of the printable characters of the US-ASCII 
range (roughly [0-9A-Za-z] + some), must be encoded in "%AB" notation, 
where "%AB" is : the "%" sign, followed by a 2-digit hexadecimal 
representation of the byte value.

In other words it means that, when interpreting data that comes as part 
of a URL (like the query string in a HTTP GET),
- the server first decodes the URI from the "%AB" encoding above, back 
into a series of bytes
- then the server further decodes this series of bytes into a string of 
characters, using some charset encoding
- but, the only way to know in which character set the data really is, 
is *by convention* between the client and the server.

The convention, historically so far, has always been iso-8859-1.
Recently and slowly, it seems that this convention is now shifting 
toward UTF-8.
But note that it is a convention still, and that in order to make sure 
that your application (and Tomcat before it) can consider the parameters 
from a GET URL to be UTF-8, /you/ have to make sure that all URLs on 
which a user may click in one of /your/ pages, is indeed encoding the 
URLs that way.
(And thus basically also, if you receive a request from an unknown 
source, well, you have to guess..)

See in Tomcat 6.0 docs, the following attribute of the HTTP Connector :

URIEncoding :	
This specifies the character encoding used to decode the URI bytes, 
after %xx decoding the URL. If not specified, ISO-8859-1 will be used.

(The above applies to GET requests, because in that case the request 
parameters are passed as part of the URI)

Now about POST requests :

In a POST, the request parameters are not sent as part of a query string 
in a URI, but they are sent in the *body* of the request.
There are 2 ways to format a POST request from the client side :
a) as a "url-encoded" body (the default).
b) as a multipart/form-data body.
(That is the case if the <Form> tag contains the attribute :

In (a), the body consists of one long string, which looks like the query 
string of a GET :
The charset and encoding of that string are supposed to be given by the 
"Content-type" HTTP header of that POST request.

In (b), it is more complicated :
The body of the request is composed of "parts", each part representing 
one parameter.  Each part /should/ have its own Content-type header, 
indicating the type of that part, and if applicable, the character set 
and encoding of that part.

In theory thus, there should never be any confusion about the character 
set and encoding of POST data.
In the practice however, there is a lot, because browsers and servers 
alike do not always respect the above rules strictly.
For example, even modern browsers do not generally indicate a character 
set and encoding for the text parts of (b) above.

See, in Tomcat docs, the following attribute of the HTTP Connector, as 
an example of the confusion :

This specifies if the encoding specified in contentType should be used 
for URI query parameters, instead of using the URIEncoding. This setting 
is present for compatibility with Tomcat 4.1.x, where the encoding 
specified in the contentType, or explicitely set using 
Request.setCharacterEncoding method was also used for the parameters 
from the URL. The default value is false.
(ndlr: and rightly so)

Strictly speaking thus, the Request.setCharacterEncoding() method 
/should not exist/, because the character set and encoding of request 
data should always be specified by the browser, and the server should 
not guess.
And the "useBodyEncodingForURI" attribute should not exist either, 
because the URI may have a charset encoding, but it has nothing to do 
with the encoding of the request body.

In the practice, I have found that the following set of "receipes" 
generally result in predictable results :

1) under Unix/Linux, in the scripts which start Tomcat, make sure that
the process which starts Tomcat is itself started under a UTF-8 locale.
For example, set
LC_ALL="en_US.utf8"; export LC_ALL
(if on your system, "en_US.utf8" is a valid locale. Use "locale -a" to 
find out)
Under Windows, there is no such "locale" setting available, or I have 
never found it.  But the Windows JVM seems to always start in a UTF-8 
mode anyway.

2) to create your application HTML pages :
- use a UTF-8 aware editor, set for UTF-8 text mode, and save all your 
pages as Unicode/UTF-8.  (Do /not/ use Windows Notepad, because it saves 
all UTF-8 documents with a leading BOM, which is wrong.)
- make sure that all your pages include the following in the HTML <Head> 
part :
<meta http-equiv="content-type" value="text/html; charset=UTF8" />
- make sure that all your <Form> tags include the following attribute :
<Form .... accept-charset="UTF-8">

3) In theory, you should make sure that whenever your server sends a 
html page to a browser, it includes the proper HTTP "Content-type" in 
the response, with the proper charset indication (UTF-8). I don't 
exactly know how one specifies this explicitly in the case of Tomcat. 
But it seems that it does it right all by itself.

4) do /not/ use the above "useBodyEncodingForURI" attribute for the 
Tomcat Connectors.

5) If you do all that, /and/ are sure that all URL links in your html 
pages are correctly encoded in UTF-8 + %AB encoding, then also use the
attribute of the Tomcat <Connector> tags.

[OT, but not entirely] :

We definitely need a new HTTP 2.0 RFC, where :
- the URI charset/encoding is Unicode/UTF-8 by default, instead of 
- HTML pages served by servers are UTF-8 by default, instead of iso-8859-1
- browsers using multipart/form-data POST encoding MUST provide a 
"Content-type" (and, if applicable a "charset" attribute) for each part 
of the POST
- servers MUST follow the request indications for Content-type
- browsers MUST follow server response indications for Content-type
(and not like IE, make their own guesses)

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message