tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Character set issue
Date Mon, 05 Dec 2011 22:13:42 GMT
Marvin Addison wrote:
>> /can/ the servlet (or one of the filters)
>> do anything that would cause the value of "name1" to /not/ be a correct Java
>> "TÜV" string in the servlet ?
> Yes, absolutely.  If this is a posted value and some filter fires that
> coerces the encoding (e.g. request.getParameter() in the case of POST)
> of the request, all subsequent filters and the servlet will see the
> string in the encoding of the first filter.  This is why it's
> important to set the encoding as early in the servlet processing
> pipeline as possible.

Thank you for the answer.

> For your particular case it's hard to imagine an encoding in practice
> that would make that string appear incorrectly.  Both iso-8859-1 and
> utf-8 should handle Ü correctly.

I don't think that's true.  A "Ü" in iso-8859-1 is a single byte (\xDC).  In Unicode/UTF-8

encoding, it is 2 bytes (\xC39C).  (The Unicode codepoint of "Ü" is 00DC (hex), but that's

a different matter.)

So if the servlet reads a parameter from the post, thinking the post is UTF-8 while it is

really iso-8859-1, and this parameter is a "Ü", the servlet will read 2 bytes, getting 
\xDC and whichever byte follows it, and get garbage, because \xDC followed by any other 
byte is probably not valid UTF-8.
On the other hand, if the servlet reads a parameter from the post, thinking the post is 
iso-8859-1 while it is really UTF-8, and this parameter is a "Ü", the servlet will read a

single byte (\xC3), which will be converted to the Java Unicode character with codepoint 
00C3 (hex), which is a capital A tilde (can't even type that on my German keyboard).

In fact, this is what happens in reality :

We have a html page, defined as being content-type="text/html; charset=UTF-8".
It is saved as UTF-8, by a Unicode-savvy editor.
It is received by the browser, and the browser (IE or Firefox) says that the document is 
The page contains a <form> tag, which contains an enctype="UTF-8" attribute.
The form contains an input text box, in which the user types a "Ü" and then submits the form.

In the normal configuration of the target webapp, there are
(in that order).
servlet reads the post parameters and the servlet gets garbage instead of the Java string

If we remove filter1 and filter2, leaving servlet alone, then servlet reads the proper "Ü".

In we re-instate filter1 and filter2, and in filter2 (the only piece of which I control 
the code), I add an early call to
then servlet gets the correct string.

Who is "responsible" for setting the request character set ? In my naive understanding, I

thought that whenever a method call happens which requires parsing the request body, and 
if by that time the request encoding has not been set explicitly, it would be Tomcat code

which would evaluate the circumstances and set the encoding appropriately.
Such as :
- default is iso-8859-1 (as per HTTP default)
- but if the request somehow says otherwise (*), then whatever the request says.
   ((*) which for a POST it should always do, no ?)

Is that a wrong understanding ?
(I read the Servlet Spec v 3.0, section 3.10, but I am still not sure)

filter2 contain calls, in that order, to
- config.getInitParameter
- optionally, for testing : request.setCharacterEncoding("UTF-8")
- request.getRequestURL
- request.getQueryString
- request.getRemoteAddr
- request.getHeaderNames
- request.getHeader
- request.getAttributeNames
.. and, finally, a
- request.getParameter

Is it then the responsibility of filter2 to set the request encoding ?
Should the optional request.setCharacterEncoding become mandatory ?
Should the request.setCharacterEncoding call be made just before the request.getParameter,

or is there another earlier method call in the list above that can trigger the encoding to

be already set ?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message