tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: [slightly OT] FORM based authentication and utf-8 encoding of credentials
Date Wed, 26 Jun 2013 15:40:35 GMT
Shanti Suresh wrote:
> Hi Chris,
> This is such an interesting discussion.  I am not sure what to make of this
> person's comment:
> -------------------
> TAXI   2012-10-09 09:03:59 PDT
> Wow, no fix since 8 years...
> And this is a real bug: If the HTTP header says the file is encoded in
> ISO-8859-1 the common way to override this with HTML is:
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> Firefox reads the body in UTF-8 then, which is fine, but the charset
> used in forms is still ISO-8859-1, so you have to add
> accept-charset="utf-8" to the form just for firefox (other browser
> automatically use UTF-8 or send the charset with the content-type).
> So: Why the hell is nobody fixing this bug?
> ---------------
> So the questions I have are:
> (1) Firefox is not properly sending UTF-8 in the POST request even if it
> reads the HTML page in UTF-8?  And other browsers are now sending
> "charset=utf-8" based on the the HTML META tag?
> (2) Firefox has started respecting the accept-charset="utf-8" attribute in
> forms now such that it adds charset to the Content-Type header of the POST
> request?   I'm confused.  I thought Mozilla was not going to fix  this
> issue.
> Thanks for any clarifications.

I think that you are still confused.. :-)
(As are, in part, some of the people who posted on that Mozilla bug).

(1) browsers, in general, are *not* sending a "charset" attribute in their POST 
submissions (whether form-url-encoded or multipart).
This is a real pity, because it is the source of much confusion, and the real reason why 
servers have to go through loops to figure out (or force) the character set/encoding of 
the data that they are getting from browser POSTs.
And the Mozilla people seem to say that it is that way, because when they tried to add 
this "charset" attribute, it broke a number of server applications at the time (8 years 
ago), and they see no reason to think that it would not still be the same today, so they 
arer not trying it again.

(1a) what browsers *will* do, in general, is to send POST data in the same character 
set/encoding as the one of the HTML *page* which contains the form being posted.
But, even when sending UTF-8 encoded data according to this principle, they are *not* 
indicating that it is UTF-8 data, which is basically wrong, because the standard HTTP/HTML

character set is iso-8859-1, and they *should* indicate it when that is not what they are

sending.  But that is the reality.

(2) the "accept-charset" attribute of a <form> does not mean that this <form>
will *send* 
data according to that charset/encoding.  It indicates that any data that is entered in 
the form's input boxes will be interpreted as being in that charset.
So the fact of adding an "accept-charset" attribute to your <form> tags does not make
so that the browser will magically change its behaviour when POSTing data.

In other words, it's a mess, and the mess is mainly due to some lack of precision in the 
original RFC's, but it is being perpetuated now by the fear of browser developers of 
breaking server applications by doing things right.
Which is rather funny in a way, considering all the things that browser developers do all

the time anyway which do break existing applications.

We really need an RFC for HTTP 2.0, with UTF-8 as the default charset/encoding.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message