forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Crossley <cross...@apache.org>
Subject Re: why is forrest xml output not utf-8
Date Wed, 01 Feb 2006 00:46:57 GMT
Thorsten Scherler wrote:
> Ross Gardler escribi??:
> > David Crossley wrote:
> > > I don't know much about encodings, but why are the documents output
> > > by our xml serializer as ISO-8859-1 rather than UTF-8?
> > 
> > I'm confused, I thought we were outputing UTF-8, certainly recent 
> > messages on the user list say we are.
> 
> Forrest is using UTF-8 in both skins and the dispatcher as *HTML*
> serializer.
> 
> main/webapp/sitemap.xmap
> ...
> <map:serializer name="html" mime-type="text/html"
> src="org.apache.cocoon.serialization.HTMLSerializer">
>  <doctype-public>-//W3C//DTD HTML 4.01 Transitional//EN</doctype-public>
>  <doctype-system>http://www.w3.org/TR/html4/loose.dtd</doctype-system>
>  <encoding>UTF-8</encoding>
> </map:serializer>
> 
> dispatcher/internal.xmap
> ...
> <map:serializer logger="sitemap.serializer.xhtml" mime-type="text/html" 
>  name="xhtml" pool-grow="2" pool-max="64" pool-min="2" 
>  src="org.apache.cocoon.serialization.XMLSerializer">
>  <doctype-public> -//W3C//DTD XHTML 1.0 Strict//EN </doctype-public>
>  <doctype-system> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd 
>   </doctype-system>
>  <encoding>UTF-8</encoding>
> </map:serializer>
> 
> David did not ask for this format (html, on which the user threads are
> about) but for *xml*.

Yeah, our answers to Ross crossed in mid-air.

> in main/webapp/sitemap.xmap you find:
> <map:serializer name="xml" mime-type="text/xml"
> src="org.apache.cocoon.serialization.XMLSerializer"/>
> 
> Unlike the other examples I gave this one do not have set any encoding.
> > > http://cocoon.apache.org/2.1/userdocs/xml-serializer.html
> states:
> "The XML Serializer accepts following configuration parameters. These
> configurations are not Xalan specific.
> 
> Name - Xalan Default Value
> ...
> encoding - none
> ..."
> 
> Looking on http://localhost:8888/index.xml I find 
> <?xml version="1.0" encoding="UTF-8"?>
> 
> We state in our FAQ for the question: Does Forrest handle accents for
> non-English languages?
> "This is because sources for Forrest docs are XML documents, which can
> include any of these, provided the encoding declared by the XML doc
> matches the actual encoding used in the file."
> 
> David, why do you think we would use ISO-8859-1 for xml?

I am presuming that we just forgot to set the UTF-8
encoding parameter.

> Is it because of:
> <map:serializer name="links"
> src="org.apache.cocoon.serialization.LinkSerializer">
>  <encoding>ISO-8859-1</encoding>
> </map:serializer>

That was going to be my next question. That should
be UTF-8 too.

> http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
> gives the answer for this:
> "...HTML, on the other hand, allows the entire range of the ISO-8859-1
> (ISO-Latin) character set to be used in documents..."
> 
> Doing a grep on forrest-trunk brings some hits on ISO-8859-1. Some of
> them like the i18n stuff are needed for german, french, spanish, ...

Are you sure?

> The cap.xml is the only file that declared to need ISO-8859-1 as well.

I reckon that is an accident from the original
author's text editor. The rest of our xml source
docs should be UTF-8. I asked this once long ago
on cocoon-dev and the answer was emphatic to use
UTF-8 across the board.

> Some xsl have as well use this ISO.
> 
> Anyway IMO the answer to the subject of this thread is, that forrest
> *is* using UTF-8 on xml documents that is using this encoding.

I think that we actually have inconsistency problems.

-David

Mime
View raw message