cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Portier <...@outerthought.org>
Subject Re: Encoding problems, still!
Date Sat, 30 Oct 2004 00:42:35 GMT


Tuomo L wrote:
> Ok, now I'm really confused.
> 
> In Bruno's excellent paper about Cocoon encoding, there's a section that 
> says:
> 
> "For Java-insiders: what Cocoon actually does internally is apply the 
> following trick to get a parameter correctly decoded: suppose "value" is 
> a string containing a request parameter, then Cocoon will do:
> 
> value = new String(value.getBytes("ISO-8859-1"), "UTF-8");      "
> 

correct.

this trick is the re-en-decoding
we get a string from getParameter, we encode it to bytes with ISO-8859-1 
and decode from there with UTF-8

why? to correct the container's mistake

the container will have received bytes (let's call these the 
original-request-parameter-bytes) but will have applied his 
'container-encoding'  on those to be able to return a String over 
getParameter.

NOTE: this container encoding is a property of your chosen container and 
typically fixed to being iso-8859-1, unless you are running jetty with 
the mentioned charset-property set you should never changes this)

now, cocoon knows from the form-encoding in which encoding forms have 
been serialized out, and thus how request params will be *really* encoded

so to correct the error the container made we encode back to the 
original bytes using latin-1 and then apply the correct form-encoding 
(utf-8)

between servlet-spec 2.2 and 2.3 this issue occured to the peeps doing 
the spec and they added setCharacterEncoding() to the servlet-request 
and mention explicitely that you need to call that before reading any 
getParameter (or any related action that requires to parse and thus 
decode the query-string)


> But then in the bug report for Xalan (someone having this same problem) 
> it says:
> 
> "According to section 16.2 of the XSLT Recommendation [1], non-ASCII 
> characters in URI attribute values should be escaped using the method 
> recommended in Section B.2.1 of the HTML 4.0 Recommendation [2]. The 
> latter recommends that non-ASCII characters be represented in UTF-8 
> prior to applying the "%HH" escaping described by the URI RTF, 
> regardless of the output encoding."
> 

nifty, didn't know... so whatever output encoding you set the uri's will 
be utf-8 encoded, and then url-encoded?

haven't ever seen this, I was under the impression that to xalan 
attributes were just attributes and would have expected characters to be 
replaced by character-entity-refs depending on if they are supported or 
not by the applied output-encoding

> This is what Xalan does (HTML serialization), so it obeys the spec.
> 
> Correct me if I'm wrong, but during serialization if there are special 
> characters (above 128) in an URL:s request parameters (href-attributes 
> etc.), they are first encoded in UTF-8 by Xalan. Even if the browser 

apparently, would like to see some test evidence to be on the safe side 
though

> detects the page as ISO-8859-1 or anything else, these URL:s in the HTML 
> source contain parameters in UTF-8. Now, when user clicks on this link, 

but it is not about request-parameters is it?
it is about the proper URL part, no?

as in:

http://server:port/path/more-path?request-param=value
---------------------------------|-------------------
  >>  area-not-fixed-by-cocoon << |  >> area fixed by cocoon <<

(in fact I'm even doubthing if we are fixing the names of the 
request-params (actually my guess would be we're only doing the values))

see 
http://cvs.apache.org/viewcvs.cgi/cocoon/trunk/src/java/org/apache/cocoon/environment/http/HttpRequest.java?rev=55600&root=Apache-SVN&view=auto

there is the internal decode() method. it gets only called from areas 
that do with request-parameter-values (as I started to think: not even 
the names)

> Cocoon reads the request parameters in as ISO-8859-1, and converts them 
> to UTF-8, without knowing that these parameters were already UTF-8!
> 

nope, don't think so... first nuance (see above) the container reads
and applies (typically) ISO-8859-1,...

and cocoon correctly re-encodes request-parameter-values based on its 
'form-encoding', but isn't (at least to my knowledge) touching the url 
part of things


(sorry for the confusion but that exactly was the executive summary from 
my previous post)


hope this clarifies the issue
hope this strengthens your trust in the proposed workarounds...

-marc=


> My knowledge of the Cocoon internals is not very good, but could this be 
> the problem?
> 
> -Tuomo
> 
> 
> On Fri, 29 Oct 2004, Marc Portier wrote:
> 
>> just scanning through this issue fast it seems to me like more 
>> evidence of things expressed here: 
>> http://marc.theaimsgroup.com/?t=109231177100007&r=1&w=2
>>
>>
>> rehashing what I read from Tuomo's setup:
>>
>> - cocoon-servlet init params are set to have container-encoding 
>> unchanged (thus iso_8859_1) like we recommend and form-encoding to 
>> utf-8 to make sure his forms can support wide variety of characters
>>
>> - as a consequence of this last setting (and the wellknown 
>> browser-limitation) this means we need to sync the encoding on the 
>> serializer to this same utf-8
>>
>> - because of this setting there is no reason to complain about the 
>> resulting HTML, that is full of utf-8 encoding, no need to refer to 
>> specs or blame cocoon: xml serialization was requested to use utf-8 so 
>> it does (even xalan does its work here I suppose)
>>
>>
>> now, what goes wrong?
>>
>> well, I had planned to get into this during gt2004s hackathon but got 
>> distracted on other issues.  Lacking the experience of the in depth 
>> debugging session I can't really do more then express my current 
>> 'suspicions'
>>
>> (as stated in the thread above)
>> we've done quite a good job at solving the issue regarding encodings 
>> of request-parameters and even extended the servlet 2.3 new insights 
>> in doing so (setRequestEncoding()) to support even 2.2 containers
>>
>> however, one important part of the request object set of getters is 
>> escaping this: the URL (and some of its derived 'paths' as well I assume)
>>
>> This explains why encoding in form-request params gets fixed 
>> correctly, but the url itself remains broke --> consequence:
>> - you can't link to non-latin-char-urls but you can pass 
>> non-latin-request-params
>>
>> in more cocoon detail this means you can't expect cocoon matchers to 
>> get correctly triggered by non-latin-urls as well as you can't 
>> automount sitemaps in directories with non-latin-only-names...
>> (or read resources with non-latin-only-names as the original post of 
>> the other thread was about)
>>
>>
>>
>> Suggestion:
>> 1. do some tests to verify above and list them as known limitations on 
>> appropriate wikis. --> tell about the two workarounds:
>> a/ to avoid non-latin urls (even if w3c says all urls should be utf-8 
>> encoded)
>> b/ use jetty, set org.mortbay.util.URI.charset property and then DO 
>> change the cocoon 'container-encoding' param accordingly
>>
>> 2. (assuming my analysis is correct and gets confirmed by the tests) 
>> extend our http-wrapping-encoding-fix to include the urls and paths as 
>> well (using the tests as a way to verify the success of this)
>>
>> 3. start the crusade for the abolishment of all encodings but utf-8!
>>
>>
>> The time consuming part here is jamming together an easy deployable 
>> testsuite (zip with automount sitemap and all needed stuff inside) 
>> covering the various aspects... would be cool if somebody else could 
>> be doing that...
>>
>> regards,
>> -marc=
>>
>> Joerg Heinicke wrote:
>>
>>> On 29.10.2004 08:44, Tuomo L wrote:
>>>
>>>>>> We're having some serious encoding problems. This happens only 
>>>>>> with the @href attributes in html, when using characters like å,
ä 
>>>>>> and ö (in Finnish alphabet). Form encoding works just fine. I've

>>>>>> gone through all the threads concerning encoding (other people 
>>>>>> having encoding problems too). No luck so far. Is this still an 
>>>>>> issue in Cocoon? Could someone please tell what's wrong?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> What's the page encoding? Forms work like expected? Just the links 
>>>>> don't work? This normally points to a different page encoding than 
>>>>> UTF-8 as link requests are encoded in UTF-8 while form requests are 
>>>>> encoded in page encoding. I don't think it is a Cocoon issue.
>>>
>>>
>>>
>>> First a link about all the encodings:
>>> http://wiki.apache.org/cocoon/RequestParameterEncoding (mostly written
>>> by Bruno).
>>>
>>>> According to IE, the page encoding is set to UTF-8. The
>>>> container-encoding and form-encoding in web.xml (Tomcat) are set to 
>>>> UTF-8.
>>>
>>>
>>>
>>> The container-encoding should not be touched at all and remain 
>>> ISO-8859-1.
>>>
>>>> HTMLSerializer is set to use UTF-8 (mime-type="text/html; 
>>>> charset=utf-8")
>>>> and has the parameter <encoding>UTF-8</encoding>.
>>>
>>>
>>>
>>> This should result in <meta http-equiv="Content-Type"
>>> content="text/html;charset=utf-8">. The request encoding header should
>>> have the same value ... what's not that easy when using a recent Tomcat:
>>> http://issues.apache.org/bugzilla/show_bug.cgi?id=26997
>>>
>>>> The xsl stylesheets use ISO-8859-1, though.
>>>
>>>
>>>
>>> That's not a problem.
>>>
>>>> I've also tried setting everything to ISO-8859-1, but
>>>> the problem with the href-attributes in html remains. Mozilla Firefox
>>>> shows the characters correctly when doing "view source", but if I 
>>>> save the
>>>> document on disk and open with ASCII-editor, the encoding is wrong 
>>>> there
>>>> with both IE and Mozilla. So maybe it's not a browser problem?
>>>>
>>>> Here's an example:
>>>>
>>>> <a href="äö" foo="äö">äö</a>
>>>>
>>>> becomes:
>>>>
>>>> <a href="%C3%A4%C3%B6" foo="&auml;&ouml;">&auml;&ouml;</a>
>>>>
>>>> when it should read (I think):
>>>>
>>>> <a href="&auml;&ouml;" foo="&auml;&ouml;">&auml;&ouml;</a>
>>>
>>>
>>>
>>> ...
>>> follow-up mail:
>>>
>>>> The URL-encoding is done wrong when serializing to HTML. According to
>>>> specs "äö" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6".
>>>> This seems to be the problem. So far I've noticed this problem with
>>>> the HREF-attribute only.
>>>>
>>>> For a test I made a styslesheet that substitutes "ä" with "%E4"
>>>> before serializing to HTML. This works, but it should be done by the
>>>> serializer, right?
>>>>
>>>> Seems like a Cocoon issue.
>>>
>>>
>>>
>>> If it would be an error at all, it would be a Xalan serializer problem I
>>> think. But there were bugs reported on this topic and rejected because
>>> of the specs (I think they have the same problems like you):
>>>
>>> http://nagoya.apache.org/jira/browse/XALANJ-1412
>>> http://nagoya.apache.org/jira/browse/XALANJ-1548
>>>
>>> As I wrote: you simply get different request encodings when sending a
>>> form or just clicking <a href=""/>.
>>>
>>> Joerg
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>>> For additional commands, e-mail: users-help@cocoon.apache.org
>>>
>>
>> -- 
>> Marc Portier                            http://outerthought.org/
>> Outerthought - Open Source, Java & XML Competence Support Center
>> Read my weblog at                http://blogs.cocoondev.org/mpo/
>> mpo@outerthought.org                              mpo@apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>> For additional commands, e-mail: users-help@cocoon.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
> 

-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Mime
View raw message