forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thors...@apache.org>
Subject Re: why is forrest xml output not utf-8
Date Wed, 01 Feb 2006 00:05:50 GMT
El mar, 31-01-2006 a las 20:39 +0000, Ross Gardler escribió:
> David Crossley wrote:
> > I don't know much about encodings, but why are the documents output
> > by our xml serializer as ISO-8859-1 rather than UTF-8?
> 
> I'm confused, I thought we were outputing UTF-8, certainly recent 
> messages on the user list say we are.
> 

Forrest is using UTF-8 in both skins and the dispatcher as *HTML*
serializer.

main/webapp/sitemap.xmap
...
<map:serializer name="html" mime-type="text/html"
src="org.apache.cocoon.serialization.HTMLSerializer">
 <doctype-public>-//W3C//DTD HTML 4.01 Transitional//EN</doctype-public>
 <doctype-system>http://www.w3.org/TR/html4/loose.dtd</doctype-system>
 <encoding>UTF-8</encoding>
</map:serializer>

dispatcher/internal.xmap
...
<map:serializer logger="sitemap.serializer.xhtml" mime-type="text/html" 
 name="xhtml" pool-grow="2" pool-max="64" pool-min="2" 
 src="org.apache.cocoon.serialization.XMLSerializer">
 <doctype-public> -//W3C//DTD XHTML 1.0 Strict//EN </doctype-public>
 <doctype-system> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd 
  </doctype-system>
 <encoding>UTF-8</encoding>
</map:serializer>

David did not ask for this format (html, on which the user threads are
about) but for *xml*.

in main/webapp/sitemap.xmap you find:
<map:serializer name="xml" mime-type="text/xml"
src="org.apache.cocoon.serialization.XMLSerializer"/>

Unlike the other examples I gave this one do not have set any encoding.
> > http://cocoon.apache.org/2.1/userdocs/xml-serializer.html
states:
"The XML Serializer accepts following configuration parameters. These
configurations are not Xalan specific.

Name - Xalan Default Value
...
encoding - none
..."

Looking on http://localhost:8888/index.xml I find 
<?xml version="1.0" encoding="UTF-8"?>

We state in our FAQ for the question: Does Forrest handle accents for
non-English languages?
"This is because sources for Forrest docs are XML documents, which can
include any of these, provided the encoding declared by the XML doc
matches the actual encoding used in the file."

David, why do you think we would use ISO-8859-1 for xml?

Is it because of:
<map:serializer name="links"
src="org.apache.cocoon.serialization.LinkSerializer">
 <encoding>ISO-8859-1</encoding>
</map:serializer>

http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
gives the answer for this:
"...HTML, on the other hand, allows the entire range of the ISO-8859-1
(ISO-Latin) character set to be used in documents..."

Doing a grep on forrest-trunk brings some hits on ISO-8859-1. Some of
them like the i18n stuff are needed for german, french, spanish, ...
The cap.xml is the only file that declared to need ISO-8859-1 as well.
Some xsl have as well use this ISO.

Anyway IMO the answer to the subject of this thread is, that forrest
*is* using UTF-8 on xml documents that is using this encoding.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)


Mime
View raw message