tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: [slightly OT] FORM based authentication and utf-8 encoding of credentials
Date Tue, 02 Jul 2013 15:26:04 GMT
Shanti Suresh wrote:
> Greetings,
> On Wed, Jun 26, 2013 at 4:08 PM, Christopher Schultz <
>> wrote:
>> Hash: SHA256
>> André,
>>> But, even when sending UTF-8 encoded data according to this
>>> principle, they are *not* indicating that it is UTF-8 data, which
>>> is basically wrong, because the standard HTTP/HTML character set is
>>> iso-8859-1, and they *should* indicate it when that is not what
>>> they are sending.  But that is the reality.
>> No, as much as it pains me to do so, I agree with with Mozilla folks
>> on this one: adding a charset attribute to an
>> application/x-form-urlencoded Content-Type violates the spec. There is
>> no good solution.
>> ...
>>> We really need an RFC for HTTP 2.0, with UTF-8 as the default
>>> charset/encoding.
>> +1
>> Maybe they can clear-up Tomcat logging configuration while they are at
>> it :)
> Thank you!  This discussion was quite informational.

You are welcome.

Further as relatively [OT], in some other - non-Tomcat, non-Java - applications, we solve

the general issue as follows (taking into account the deficiencies of the RFCs, the 
servers, the browsers, and the users) :
- when programmers create the html documents containing the forms, they must make sure 
that they use a tool which really saves the html document in the charset/encoding that 
corresponds to their wishes
- the html page should contain a declaration like :
<meta http-equiv="Content-Type" content="text/html; charset=xxxxx" />
(where xxxx is the correct charset/encoding, like "UTF-8")
- each form that is sent to the browser is sent by the server with an explicit HTTP header

: Content-type: text/html; charset=xxxx
(that normally happens automatically, but you should nevertheless check that it matches)
- the <form> tag of the form should contain the "accept-charset" attribute with the

expected character set as value, like
<form accept-charset="UTF-8" ...>
- the form itself contains a hidden parameter like :
<input type="hidden" name="charset-test" value="yyyyy">
(where yyyyy is a character sequence which is so that, seen as bytes, its length would be

different depending on the real character set used. E.g., for Europe, "ÖöÜüÄä")
- the application which receives the form submit data, must first check if the string 
received for the "charset-test" parameter matches its expectations.
In other words, if the application expects UTF-8, then it should check that the received 
string has a byte length of 12 and a character length of 6, and matches a Unicode string 
And if it doesn't, then it should take appropriate action (abort the action, or try 
another charset)
(if the form sent by the server contains additional data coming from a back-end database 
system, then one should also check that the charset of that data matches the one of the 
form of course).

This may look a bit like overkill, but it is the result of long and painful real-world 
experience with multi-lingual applications used with multiple browsers and multiple types

of users in multiple countries doing cut-and-paste of all kinds of stuff into forms.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message