From Konstantin Preißer <>
Subject Website: "Content-Type" header includes "charset=UTF-8" for some static file types
Date Mon, 28 Oct 2013 22:45:47 GMT
Hi all,

I noticed that the HTTP Server which serves (and also other Apache sites)
automatically includes a "charset=UTF-8" field in the Content-Type header for static *.html
files and for *.txt files, independently from the actual encoding of the file.
E.g., if you request (static html page), then the Content-Type
header will be:

Content-Type=text/html; charset=utf-8

Now, although I'm a fan of using UTF-8 for everything (especially for Web pages), and including
a "charset" field in the Content-Type probably saves the browser some time as it doesn't need
to find out the encoding from the file content, this means that some .html pages have conflicting
encoding declarations, as not all .html pages on the Tomcat Website are encoded as UTF-8.

E.g., for this page:
the Encoding in the Content-Type header says "UTF-8", but the encoding which is declared in
the file content says "ISO-8859-1" which is the actual encoding of the file.

As the encoding from the HTTP Content-Type header takes precedence, browsers will interpret
the file as UTF-8 instead of ISO-8859-1. This can mean that if the file contains non-ASCII
characters (> 0x7F), a browser will display them incorrectly because of  the wrong encoding.
This affects mostly the Docs of older Tomcat versions (3, 4, 5, 6, 7) as they are in ISO-8859-1,
whereas Tomcat 8's docs are in UTF-8. (Though as far as I have seen, none of these .html pages
use non-ASCII characters directly but encode them as entity references or character references,
so for them this issue does not have practical consequences.)

While for Tomcat 6 and 7 the XSLT probably can be changed to output as UTF-8, I don't know
if something like this should be done for docs of unsupported versions like 3.x etc.

This is an example of a site where the conflicting encodings cause problems:

In the LHS menu, there is a <h5> element with text "Commons DBCP", but the space is
actually a 0xA0 character (nbsp). As this is a non-ASCII character, browsers will fail to
decode it when using UTF-8, so they display "�" (U+FFFD, Replacement Character) instead.
If you manually change the encoding to ISO-8859-1 in the browser's menu, the page will be
displayed correctly.

It seems that this issue has been existing for some time now, as with r1182745, the output
encoding of the Tomcat Site's XSLT has been changed to UTF-8 by Konstantin Kolinko, with the
commit message:
"Change output encoding, so that <META> header added by XSTL processor matches with
HTTP Content-Type header added by site."

Does anybody know the reasoning behind adding a "charset=UTF-8" field in the Content-Type
for every *.html page? Should a issue be raised for this at Apache Infra?


Konstantin Preißer

