cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Portier <>
Subject Re: Encoding problems, still!
Date Sun, 31 Oct 2004 17:16:14 GMT

Tuomo L wrote:
>> taking one step at the time (what am I not seeing?):
>> - suppose a sax stream (producing xhtml) before serialization has a 
>> @href holding an eurosign (\u20AC unicode char)
>> - I hear you guys saying that xalan will recognize the uri-type 
>> attribute and serialize this character out as %E2%82%AC regardless of 
>> the chosen output encoding (didn't catch it but I am assuming that the 
>> output-encoding is set to UTF-8 anyways, and matches the form-encoding 
>> setting)
>> - so we get an html page out telling the browser it is utf-8 encoded
>> - so the browser will apply utf-8 encoding to form-values (and names) 
>> if this were about a form, but it's about this ready @href
>> - now this @href already has this same encoding (thx xalan) in place: 
>> so things should work the same as for the form (as long as the 
>> mentioned eurosign is strictly in the parameter-values)
>> So assuming all this reasoning is ok, what could never work is this:
>> - change your form-encoding (and matching setting of serialization) to 
>> anything else then UTF-8, cos then request-params in forms and 
>> pre-built ones in url's get encoded differently and we have no way to 
>> make a distinction over at cocoon's side
> You're right.

thx for confirming

>> It's sad news for Tuomo, but I can't see why it wouldn't be just 
>> working if (and only if)
>> - this is about parameter-values and NOT about URL's or 
>> parameter-names (because there we *need* to do some work)
> Yes, I was talking about parameter values all the time, but didn't show 
> it clear enough in the example. It should be:
> <a href="someurl?foo=äö" foo="äö">äö</a>

ok, that makes things clear

> Where the foo's value gets UTF-8 encoded by Xalan during serialization, 
> no matter what the settings are where ever.
>> - container-encoding is traditionally set to ISO-8859-1 (unless using 
>> a container like jetty where you can modify it's internal behaviour)
> Mine is set to ISO-8859-1.

good, keep it like that

>> - form-encoding is strictly kept to 'utf-8' (thx for the lesson) and 
>> the serializer follows that (meta-equiv and all)
> These don't help either, since the UTF-8 encoded parameter values are 
> read in as ISO-8859-1 and the output is invalid. If these parameter 

now this I don't understand

they are indeed read in using ISO-8859-1, but then inside cocoon they 
get re-en-decoded:
1. yourUtf8UrlEncodedValue --> first urldecoded and then interpreted by 
container using ISO-8859-1
2. this result re-encoded by cocoon using 'container-encoding' 
3. the bytes coming out of that should equal the bytes of the 
parameter-value right after url-encoding
4. so decoding these with 'form-encoding' (==UTF-8) should really just work

> values are now put for example in database, there are several '?'-marks 
> where those special characters should appear.

well, as a general remark you have to be careful with both

1. databases --> they typically have an encoding set too, and you should 
consult the settings of your jdbc driver to make sure you're not having 
a mismatch there

2. interpreting question-marks: I remember spending oodles of time 
looking at something that worked all the time just because the tool I 
used to read the logfiles or sql-output was not supporting the encoding 
or was using a font that had no glyph for a certain character then you 
can spot these questionmarks while all is well in fact)

anyways: safest thing to do is some code step debugging (at the level of 
the 'decode' method mentioned earlier) or inserting javacode that counts 
the length of the string or even better compares/or dumps intvalues of 
all chars in JVM memory

best to take it one step at a time...

> Maybe I just have to send the parameters within a form (as Joerg had 
> done it), which is not a very practical when you only need to do a 
> simple HTTP-GET with parameters. Or then I use a XSL-stylesheet which 

I agree

and as argued above this doesn't make sense: the form will be encoding 
the values exactly in the same way (ie. first utf-8 then url-encode) as 
xalan prepared things... so things should really just work IMHO

> converts all the special characters in parameter values to ISO-8859-1 


> before Xalan serialization. This works, but is also inpractical, since I 
> have to write a long xsl:choose-section. Doing it this way also 
> decreases the performance of my application.
> Can we come up with a better solution?
> Thank you guys for taking interest in this issue.

I'ld like to just understand first, and if we need to then also fix this 
for sure...

-marc= (off for 5 days helas, I hope you guys find a nice way out - and 
let us know)

Marc Portier                  
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                          

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message