tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Tomcat 5 and UTF-8
Date Fri, 03 Apr 2009 22:43:31 GMT

One of my preferred subjects...

1) as per the HTTP specs, the server should send a Content-Type header 
along with any response to a browser.  If the response is of the general 
type "text", then this Content-Type header should also contain a charset 
attribute, indicating the character set and the encoding.
If not indicated, this defaults to iso-8859-1 (which is a charset and an 
8-bit encoding).
Apache and Tomcat normally do that, but a badly-written application can 
  override that and screw things up.  There are also cases where Apache 
and Tomcat genuinely do not know, as when picking up a file from disk, 
and have to pick either the default iso-8859-1 or what their 
configuration specifies as a default.
Of course this is sometimes wrong.

2) also per the HTTP specs, when the server sends a Content-Type header, 
the client (browser) should not second-guess the server. It should 
accept and respect the header in order to interpret the content.
Major discrepancy : all versions of IE which I know of second-guess the 
server, in clear violation of the HTTP specs, and make their own 
inspection and heuristic determination of the content received, and 
unfortunately they get it wrong in a number of cases.  Unfortunately 
also, since IE still accounts for over 90% of the browsers used in 
corporate environments, the poor webapp programmer is forced to take 
this bad behaviour into account.

3) If the server sends back a document prefixed by a BOM, then IE also 
automatically interprets the documents as being Unicode, no matter what 
the server (or the document) say.  This is stupid because a UTF-8 
encoded document does not need a BOM, considering it is a byte-oriented 
encoding anyway, with no possibility of getting a byte-order wrong.
Windows Notepad saves all Unicode documents with a BOM, even when saving 
them as UTF-8.

4) the HTML specs are distinct from the HTTP specs.  In the HTML specs, 
there exists a <meta HTTP-equiv="Content-Type" ..> tag, which supposedly 
can contain a charset indication about the content of this HTML page.
I personally find this rather clumsy, because the client has to start 
reading and decoding the HTML document before it can read and interpret 
this header, so its real practical significance is doubtful.  It also 
seems to be superfluous and confusing considering (1) and (2) above. 
(Like, what if (1) and (4) specify different charsets/encodings ?).
But ok, it might be of some use for HTML editors, which could use this 
to try to interpret correctly a document loaded from disk, in which case 
there is no Content-Type sent by a server.

5) as well the HTTP specs as the HTML specs, are still not entirely 
precise nor unambiguous about some aspects of the general character set 
issues. For example, when a POST request contains data encoded as 
"URL-encoded".  Also, even modern browsers (including Firefox 3) do not 
properly specify the encoding of multi-part POSTs.

6) encoding rules are different for the URLs, for the HTTP headers, and 
for the content.  Even a URL has two distinct types of encoding : the 
part for the hostname (Punycode, rfc 3492), and the part for the path 
and query-string (charset unspecified, percent-encoding).

7) It never ceases to amaze me, the amount of productive time lost every 
year with character set issues on the web, when Unicode/UTF-8 has been 
around for several years as a charset/encoding covering all languages 
known to man and beyond.  Why hasn't a proposal for HTTP 2.x / HTML 5.x 
come about, reconciling those aspects and establishing Unicode/UTF-8 as 
the default (or only) encoding, for URLs as well as content ?

8) What is also missing in my view, is some more general proposal 
covering the format of text files (and text streams), anywhere.  To 
alleviate any ambiguity, each text file/stream should contain at least a 
short prefix indicating its MIME type and its charset/encoding.

All the above is why I keep on seeing my name echoed back to me as 
André, even on some well-known supposedly international websites.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message