tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Kolinko <knst.koli...@gmail.com>
Subject Re: Character set issue
Date Mon, 05 Dec 2011 01:02:43 GMT
2011/12/5 André Warnier <aw@ice-sa.com>:
> Hi.
>
> I need help with a problem on a Tomcat system.  The system is of difficult
> access, and I cannot access it directly right now (this is Sunday night in
> Europe).
> I know that the system runs Tomcat 6.something, under Oracle/Sun Java 1.6,
> and that's all I can say right now. The platform is RedHat RHEL, current
> version.
>
> The problem which happens is that, after the update of a webapp (of which I
> do not have the code), it seems that non-US-English "diacritic" characters
> posted to the webapp from a web <form>, are now "corrupted". And I would
> like to understand better the Tomcat mechanism for reading HTTP request form
> parameters, so that I can start to figure out what is going wrong.
>
> The webapp consists of a single servlet, wrapped by two filters.
> The application's web.xml defines the order as
> filter1
> filter2
> servlet
> with both filters processing all requests to the servlet.
>
> "filter1" is a commercial product used on many Tomcat sites.
> "filter2" is my own filter (and it is the only part of which I have the
> source code)
> "servlet" is also a commercial product of which I do not have the code, and
> the one which has just been updated.
>
> What I would like to know is : with a setup such as the above, how does
> Tomcat determine in which /character set/ the body of the POST will be read
> ?
>
> For example :
> Suppose that we have 2 html forms, form1 and form2.  Both forms are
> functionally identical, and contain a text input box named "name1".
> The form form1 has an html declaration which specifies it as having the
> charset "iso-8859-1".
> The form form2 has an html declaration which specifies it as having the
> charset "UTF-8".
>
> The user, in the input box "name1" of each form, types the string "TÜV"
> (second character = uppercase U with umlaut) and then posts the form to the
> webapp.
> The user browser is the same in all cases.
>
> If the servlet executes a request.getParameter("name1"), what are the
> factors which can determine how it receives the value of this parameter ?
>
> Or maybe my question should be : /can/ the servlet (or one of the filters)
> do anything that would cause the value of "name1" to /not/ be a correct Java
> "TÜV" string in the servlet ?
>
> Additional information :
> Only the servlet was updated.  Prior to that update, the application worked
> correctly. So I strongly suspect that it is the updated servlet which
> creates the problem.  But I'd like to understand /how/ it can create such a
> problem, and if for example something in filter1 or filter2 could contribute
> to the problem, or not.
> Filter1 is an authentication servlet filter, and as far as I know it only
> checks HTTP headers, and does not concern itself with the body of the
> request.  But I suppose that even the request body "passes through" this
> filter, and that it could presumably corrupt this body (although I would
> consider this unlikely right now).
> Filter2 is my own filter (and I am not a Java expert).  This filter works at
> a number of installations (and also here, before this servlet update).  It
> subclasses the HTTP request, because it needs to add a HTTP header to the
> request, on-the-fly.  But the subclass only overrides the methods which have
> to do with the HTTP headers, and does not handle the body directly.
>
> Any information or ideas welcome.
>

1. I think you know the FAQ:
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

2. Make sure that the web browser understands what character encoding
the web form uses.

Some browsers remember what encoding was used on the previous page and
use that instead of what is provided by server.
Mixing both ISO-8859-1 and UTF-8 forms on the same site is bad in this sense.

Make sure that content type and charset value in
 a) Content-Type HTTP header sent by server and
 b) in META tag in HTML text
have _literally_ the same value. If they both are present and they do
not match, odd things may happen in "non-compliant" browsers.

3. A servlet or JSP page called as "include" cannot change the content
type (and thus the charset). The <%@page contentType=".."%> directive
will be ignored.


Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message