From Thorsten Scherler <>
Subject Re: Latin1 character problems in dispatcher
Date Fri, 21 May 2010 09:03:18 GMT

On 20/05/2010, at 18:41, Sjur Moshagen wrote:

> Den 20. mai. 2010 kl. 15.26 skrev Thorsten Scherler:
>> On 20/05/2010, at 14:18, Sjur Moshagen wrote:
>>>> ...
>>>> Hmm, that is weird. Please try the following:
>>>> - add a new contract that uses ñ, í and similar characters
>>>> - see what comes out
>>> I added a blank contract that just printed the same line of characters I used
earlier for testing, and this is what came out:
>>> This is a text containing problematic characters:
>>> a á c č d đ n ŋ s š t ŧ z ž ae æ oe ø ao å a¨ ä o¨ ö g ǥ h ħ
u ʉ i ɨ
>>> That is, the text from the contract comes through just fine, but text coming
from a standard Forrest v2 document gets garbled.
>>> I have attached a picture of the page as it renders. The box comes from the document,
the text at the bottom is from the contract.
>> Ok I see. 
>> Please post the dataUri you use for the contract. It seems that the utf-8 is lost
in this step. If you have the dataUrl of the contract see what is coming out there, whether
it is already scrambled or not.
> I'm not sure about how to do this, but I'll try. The dataUri used in the structurer is:
>          <forrest:contract name="content-main" 
>            dataURI="cocoon://#{$getRequest}.body.xml">   <-- this is the dataURI
>            <forrest:property name="content-main-conf">
>              <headings type="boxed"/>
>            </forrest:property>
>          </forrest:contract>
> which I take to mean:
> http://localhost:8888/index.body.xml

correct, that was the uri I needed.

> The text returned by that Uri is:
> <?xml version="1.0" encoding="ISO-8859-1"?><div id="content"><h1>Divvun
- Sámi proofing tools project</h1><div id="content-main">
> 	  <div class="note"><div class="label">UTF-8 character test</div><div
> 		There seems to be problems with certain characters, but only in
> 		Dispatcher:<br xmlns:xi=""/>
> 		a á c &#269; d &#273; n &#331; s &#353; t &#359; z &#382;
ae æ oe ø ao å a¨ ä o¨ ö g &#485; h &#295; u &#649; i &#616;
> 	  </div></div>
>  </div></div>
> Two things to note here:
> The encoding is specified as ISO-8859-1, which is wrong,

yes should be utf8.

> and which leads to all characters outside Latin1 to be encoded as numeric entities.

actually the numeric form is fine or at least should be. In my use case I take rss from roller
and the characters coming as numeric but with utf-8 encoding.

> In the next step, this causes all non-ASCII, non-Latin1 characters to survive correctly,
while the Latin1 chars will be messed up when they are reinterpreted as UTF-8 later - or something
along these line.

Yeah, it seems the numeric form is working fine but the "native" form does not play nice.
I wonder if we change the encoding of the *.body.xml returned doc whether that fixes that

> I don't know where the encoding comes from - everything on my end is marked as UTF-8.
I grepped for the string "ISO-8859-1" in the Forrest sources, and got many hits, but nothing
that seemed to relate to Dispatcher.

The *.body.xml comes from the dataModel.xmap:

<!-- HTML rendered from intermediate format -->
      <map:match pattern="**.body.xml">
        <map:generate src="cocoon:/{1}.source.rewritten.xml" />
        <map:transform src="{lm:dataModel-html-document-to-html.xsl}">
          <map:parameter name="path" value="{1}.html" />
        <map:serialize />

The serializer here is the default one.

we define it in the xmap as

<map:serializers default="xml" />

That should read:
<map:serializers default="xml-utf8" />

I added to revision 946939 please see whether that fixes the issue. I added a test note to
org.apache.forrest.plugin.internal.dispatcher/src/documentation/content/xdocs/index.xml so
you can directly run "forrest run"  in the plugin and see the outcome.

If we done testing we should remove the debug note.


Thorsten Scherler <>
Open Source Java <consulting, training and solutions>

