www-infrastructure-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Preißer (JIRA) <j...@apache.org>
Subject [jira] [Created] (INFRA-6974) "Content-Type" header includes "charset=UTF-8" for some static file types
Date Thu, 07 Nov 2013 19:19:17 GMT
Konstantin Preißer created INFRA-6974:

             Summary: "Content-Type" header includes "charset=UTF-8" for some static file
                 Key: INFRA-6974
                 URL: https://issues.apache.org/jira/browse/INFRA-6974
             Project: Infrastructure
          Issue Type: Bug
          Components: HTTP Server
            Reporter: Konstantin Preißer
            Priority: Minor


I noticed that the HTTP Server which serves Apache project websites like tomcat.apache.org
automatically includes a "charset=UTF-8" field in the Content-Type header for static *.html
files and for *.txt files, independently from the actual encoding of the file (Reference messages:
[1] and [2]).

E.g., if you request http://tomcat.apache.org/ (static html page), then the Content-Type header
will be:

Content-Type=text/html; charset=utf-8

Although I'm a fan of using UTF-8 for everything (especially for Web pages), and including
a "charset" field in the Content-Type probably saves the browser some time as it doesn't need
to find out the encoding from the file content, this means that some .html pages have conflicting
encoding declarations, as not all .html pages on Apache Project websites are encoded as UTF-8.

E.g., for this page:
the Encoding in the Content-Type header says "UTF-8", but the encoding which is declared in
the file content says "ISO-8859-1" which is the actual encoding of the file.

As the encoding from the HTTP Content-Type header takes precedence, browsers will interpret
the file as UTF-8 instead of ISO-8859-1. This can mean that if the file contains non-ASCII
characters (> 0x7F), a browser will display them incorrectly because of  the wrong encoding.
For the Tomcat 6.0 docs (linked above) this has no visible effect since they don't use non-ASCII
characters directly but encode them as entity references or character references.

However, there are some pages where the conflicting encodings have effects, mostly such that
decoding as UTF-8 fails:
1) http://commons.apache.org/proper/commons-dbcp/
2) http://commons.apache.org/proper/commons-attributes/

In the LHS menu of these 1), there is a <h5> element with text "Commons DBCP", but the
space is actually a U+00A0 character (nbsp), encoded as 0xA0. As this is a non-ASCII character,
browsers will fail to decode it when using UTF-8, so they display "�" (U+FFFD, Replacement
Character) instead. If you manually change the encoding to ISO-8859-1 in the browser's menu,
the page will be displayed correctly.

Additionally, there are some Apache sites with conflicting encodings (encoded as ISO-8859-1
but it gets overridden with UTF-8), which however doesn't seem to have visible effects:
1) http://jclouds.apache.org/
2) http://jmeter.apache.org/
3) http://perl.apache.org/
4) http://spamassassin.apache.org/
5) http://uima.apache.org/

So, I think that a "charset=UTF-8" parameter shouldn't be appended to Content-Type headers
of static resources if one isn't sure that the encoding is really UTF-8, as there are still
a number of static HTML pages which use ISO-8859-1 instead of UTF-8.

[1] http://markmail.org/message/ls473qxwtrcegyyo
[2] http://markmail.org/message/oe6re3xtkkwi24py

This message was sent by Atlassian JIRA

View raw message