cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Encoding problems
Date Tue, 11 Mar 2003 11:48:19 GMT
The new webapp welcome page contains a copyright character which is not 
encoded as the default HTML entity &copy; or the usual &xxx; char, but 
it's directly copied in the proper encoding.

The offending char is contained in the welcome.xslt stylesheet that is 
encoded as ISO-8859-1.

The pipeline does

  - welcome.xml -> ISO-8859-1
  - welcome.xslt -> ISO-8859-1
  - xhtml serializer -> UTF-8

the results are indeed encoded using UTF-8, thus the copyright sign ends 
up being 16 bits (UTF-8 is a clever mixing of 8bit and 16bit char 
encoding that was done for easy back compatibility and compression since 
most text is on the lower 8bit spectrum nowadays, UTF-16 is more even in 
that respect, but nobody uses it because text is normally half as big)

On MacOSX, the results are interesting:

  - mozilla 1.3b (20030212) displays the correct encoding
  - safari 1.0b(v60) doesn't
  - camino 0.7 (2003030613) displays the correct encoding
  - IE 5.2.2 (5010.1) doesn't

I traced the problem down to the fact that, apparently, both IE and 
Safari are *NOT* able to understand the encoding from the starting XML PI.

On the other hand, by placing

  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

the server creates an HTTP header that instructs the user-agent about 
the encoding. This solved the encoding problem on *all* browsers.

Results:

1) this is *NOT* a cocoon issue
2) be aware of the fact that some user-agents do not parse the XML PI to 
get the encoding, but only the HTTP headers.

NOTES:
1) there is no clear indication on the XHTML specification about how 
user-agents have to guess the encoding
2) there is no indication on what Mime-type the XHTML content should have.

These problems reflect the lack of direct collaboration between the IETF 
and W3C on XML/HTTP relationship. Unfortunately, this is only going to 
get worse. So be prepared, expecially for severely internationalized 
content.

Stefano.


Mime
View raw message