From Sjur Moshagen <>
Subject Re: Latin1 character problems in dispatcher
Date Thu, 20 May 2010 16:41:42 GMT
Den 20. mai. 2010 kl. 15.26 skrev Thorsten Scherler:

> On 20/05/2010, at 14:18, Sjur Moshagen wrote:
>>> ...
>>> Hmm, that is weird. Please try the following:
>>> - add a new contract that uses ñ, í and similar characters
>>> - see what comes out
>> I added a blank contract that just printed the same line of characters I used earlier
for testing, and this is what came out:
>> This is a text containing problematic characters:
>> a á c č d đ n ŋ s š t ŧ z ž ae æ oe ø ao å a¨ ä o¨ ö g ǥ h ħ u ʉ
i ɨ
>> That is, the text from the contract comes through just fine, but text coming from
a standard Forrest v2 document gets garbled.
>> I have attached a picture of the page as it renders. The box comes from the document,
the text at the bottom is from the contract.
> Ok I see. 
> Please post the dataUri you use for the contract. It seems that the utf-8 is lost in
this step. If you have the dataUrl of the contract see what is coming out there, whether it
is already scrambled or not.

I'm not sure about how to do this, but I'll try. The dataUri used in the structurer is:

          <forrest:contract name="content-main" 
            dataURI="cocoon://#{$getRequest}.body.xml">   <-- this is the dataURI
            <forrest:property name="content-main-conf">
              <headings type="boxed"/>

which I take to mean:


The text returned by that Uri is:

<?xml version="1.0" encoding="ISO-8859-1"?><div id="content"><h1>Divvun
- Sámi proofing tools project</h1><div id="content-main">

	  <div class="note"><div class="label">UTF-8 character test</div><div
		There seems to be problems with certain characters, but only in
		Dispatcher:<br xmlns:xi=""/>
		a á c &#269; d &#273; n &#331; s &#353; t &#359; z &#382; ae æ
oe ø ao å a¨ ä o¨ ö g &#485; h &#295; u &#649; i &#616;


Two things to note here:

The encoding is specified as ISO-8859-1, which is wrong, and which leads to all characters
outside Latin1 to be encoded as numeric entities. In the next step, this causes all non-ASCII,
non-Latin1 characters to survive correctly, while the Latin1 chars will be messed up when
they are reinterpreted as UTF-8 later - or something along these line.

I don't know where the encoding comes from - everything on my end is marked as UTF-8. I grepped
for the string "ISO-8859-1" in the Forrest sources, and got many hits, but nothing that seemed
to relate to Dispatcher.


