tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Kolinko <knst.koli...@gmail.com>
Subject Re: Sanity Check
Date Fri, 18 Nov 2016 19:10:39 GMT
2016-11-18 19:02 GMT+03:00 Christopher Schultz <chris@christopherschultz.net>:
> André,
>
> On 11/18/16 3:50 AM, André Warnier (tomcat) wrote:
>> On 18.11.2016 05:56, Christopher Schultz wrote:
>>> Since UTF-8 is supposed to be the "official" character encoding,
>>
>> Now where is that specified ?  As far as I know, the default
>> charset for everything HTTP and HTML-wise is still iso-8859-1, no ?
>> (and unfortunately so).
>
> I apologize for the sloppy language: this particular vendor's service
> claims that UTF-8 if the standard *for their service*. Not for HTTP in
> general.
>
>>> The vendor has responded with (paraphrasing) "it seems we don't
>>> completely follow this standard; we're considering what to do
>>> next, which may include no change". This is a big vendor with
>>> *lots* of software clients, so maintaining backward compatibility
>>> is going to be a big deal for them. I've got some tricks up my
>>> sleeve if they decide not to change anything. Hooray for specs.
>>> :(
>>
>> What I never understood in all that, is why browsers and other
>> clients never seem to respect (and servers do not seem to enforce)
>> what is indicated here :
>>
>> https://www.ietf.org/rfc/rfc2388.txt 4.5 Charset of text in form
>> data
>>
>> This would be a simple way to get rid of umpteen character
>> set/encoding issues encountered when trying to interpret <form>
>> data POSTed to web applications.
>
> The problem is that application/x-www-form-urlencoded doesn't give a
> client a natural way to specify the character encoding, and a/xwfu can
> be used inside of a multipart/form-data package as well. You've just
> moved the problem from the Content-Type of the request to the
> Content-Type of the *part* of the multi-part request. Nothing has been
> solved by using multipart/form-data.
>
> And browsers certainly DO use that, but almost exclusively for things
> like file-upload, since files tend to be very big already, and
> urlencoding a bunch of binary bytes makes the file size increase quite
> a bit.
>
>> It seems to me contrary to common sense that in our day and age,
>> the rules for this could not be set once and for all to something
>> like :
>>
>> 1) the default character set/encoding of HTTP and HTML is
>> Unicode/UTF-8 (instead of the current really archaic iso-8859-1) 2)
>> URLs (including query-strings) should be by default interpreted as
>> Unicode/UTF-8, encoded as per
>> https://tools.ietf.org/html/rfc3986#section-2 3) for POST requests
>> : - for the Content-type "application/x-www-form-urlencoded",
>> there SHOULD be a charset attribute indicating the charset and
>> encoding. By default, this is "text/plain; charset=UTF-8"
>
> Don't forget, charset == encoding. The text/plain is the MIME type,
> and that's already been defined as application/x-www-form-urlencoded.
> Somewhere it should just explicitly say "a/xwfu" must contain only
> ASCII bytes, and always encodes a text blob in UTF-8 encoding.
>
> But it will never happen (see below).

One more authority, that I forgot to mention in my mail:
IANA registry of mime types

Registry:
https://www.iana.org/assignments/media-types/media-types.xhtml

Registration entry for "application/x-www-form-urlencoded"
https://www.iana.org/assignments/media-types/application/x-www-form-urlencoded

-> Encoding considerations : 7bit

According to RFC defining this registry, it means that the data is
7-bit ASCII only.
https://tools.ietf.org/html/rfc6838#section-4.8

-> Required parameters : No parameters
-> Optional parameters :  No parameters

OK. So no charset= parameter is allowed.
My advise to specify the charset parameter was wrong.

Though historically ~10 years ago I saw
"application/x-www-form-urlencoded;charset=UTF-8" Content-Type in the
wild.

It was a web site authored in WML (Wireless Markup Language) and
accessed via WAP protocol by mobile phones.

(Specification reference for this WML/WAP usage:
http://technical.openmobilealliance.org/Technical/release_program/docs/Browsing/V2_3-20070227-C/WAP-191-WML-20000219-a.pdf

Document title:
WAP WML
WAP-191-WML
19 February 2000

Wireless Application Protocol
Wireless Markup Language Specification
Version 1.3

-> Page 30 of 110 (in Section "9.5.1 The Go Element"):
There is a table, where the following line is relevant:

Method: post
Enctype: application/x-www-form-urlencoded
Process: [...] The Content-Type header must include the charset
parameter to indicate the character encoding.

I suspect that the above URL is not the official location of the
document. I found it through Googling.
Official location should be http://www.wapforum.org/what/technical.htm
)


Apache Tomcat supports the use of charset parameter with Content-Type
application/x-www-form-urlencoded in POST requests.

>> - for the Content-type "multipart/form-data", each "part" MUST have
>> a Content-type header.  If this Content-type is a "text" type, then
>> the Content-type header SHOULD contain a charset attribute. If
>> omitted, by default this is "charset=UTF-8".
>>
>> and be done with it once and for all.
>
> Right: once and for all, for new clients who implement the spec. All
> old clients, servers, proxies, , etc. be damned. It's just not
> possible due to the need to be backward-compatible with really weird
> stuff like "smart" toasters and refrigerators, WebTV (remember that?)
> and all manner of embedded devices that will never be updated.
>
> What we really need is a new header that says "here's everything you
> need to know about encoding for this request" and clients and servers
> who both support that header can use it. All other uses need to
> fall-back to this old and nasty heuristic.
>
> - -chris


Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message