cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Quinn <>
Subject Re: request encoding conundrum
Date Thu, 24 Jul 2008 09:56:58 GMT

On 24 Jul 2008, at 09:06, Grzegorz Kossakowski wrote:

> Jeremy Quinn pisze:
>> Hi All
> Hi Jeremy! :-)

Hi Grzegorz, nice to hear from you :)

>> I am trying to solve a nasty request transcoding bug, that I found  
>> while working on CForms.
> Join the club! Discovered character encoding problems two days ago  
> in a project based on Cocoon 2.1.x. Tried to fight it yesterday and  
> gave up.

You work with 2.1 ?? I am shocked :)

>> AFAICS this bug effects older versions as well ..... accented  
>> characters not roundtripping due to bad transcoding in Cocoon,  
>> under certain circumstances.
>> CForms works in one of two modes: ajax-on and ajax-off.
>> When ajax is on, CForms submits the form via an XMLHttp Request  
>> (XHR), when it is off it submits the form normally.
>> Servlet Requests are expected by default to be encoded using  
>> ISO-8859-1 (appalling choice!!!), but of course to get any real  
>> work done on the international web, you should use UTF-8 (now  
>> Cocoon's default, thanks to Vadim).
> When I was looking at our code in HttpEnvironment, HttpRequest and  
> in MultipartParser I started to wonder if it would be an option to  
> forget about any other encodings apart from UTF-8. According to my  
> knowledge, there is no serious software that does not support Unicode.
> This would help us to clean up and simplify the code in trunk  
> greatly so it would go into 2.3 release (don't be afraid, you won't  
> need to wait for it years, I promise).
> The only problem is that I don't have any significant experience  
> with such issues so I would like to hear if my proposal makes sense.  
> Would it be possible to support Unicode only?

A change like this while simplifying our codebase, could cause utter  
havoc to users ..... I don't know if unicode really is a practical  
superset of every other possible encoding.

Sorry, I do not think I know enough about this either.

>> Browsers should post data in the encoding of the page containing  
>> the form.
>> Dojo always posts forms as UTF-8 when it does XHR, seemingly  
>> regardless of the page encoding. Furthermore, the post has a  
>> Content-Type header : "application/x-www-form-urlencoded;  
>> charset=UTF-8". (Default in FireFox3, can be set in Safari, unknown  
>> in MSIE).
>> Jetty responds properly to the Content-Type header, by  
>> automatically using that charset for decoding Request Parameters  
>> instead of the default ISO-8859-1. (behaviour of other  
>> ServletEngines unknown). This leads to a transcoding bug because  
>> Cocoon assumes ISO-8859-1.
> I think that behaviour of Jetty is correct. Right?

It /seems/ right ....

>> When forms are submitted normally (i.e. non-XHR) they usually do  
>> not contain the Content-Type header (tested with FireFox3 & Safari)  
>> and it does not seem possible to set one from JavaScript (XHR has  
>> the api to do it).
>> So unless the user has set a different encoding for the  
>> serialisation of their forms, CForms Requests will always be in  
>> UTF-8, but the Content-Type header will not always specify this.
>> If the Content-Type header contains a charset, (at least in Jetty)  
>> no further transcoding should happen. If it does not contain a  
>> charset, the encoding will be default and parameters must be  
>> transcoded.
>> So, if the header is correctly set, Cocoon's transcoding hack  
>> (o.a.c.environment.http.HttpRequest.decode) breaks, because it  
>> assumes standard ISO-8859-1.
>> Therefore we face the situation where it is impossible to get  
>> correct decoding via settings in web.xml : "container-encoding" and  
>> "form-encoding"
>> that work for both ajax-on and ajax-off forms from the same  
>> instance of Cocoon.
>> But I have a solution I think :)
>> I propose that the default settings in Cocoon's web.xml for  
>> "container-encoding" and "form-encoding" should be :
>> container-encoding : ISO-8859-1
>>    - meaning: my servlet container uses this as it's default encoding
>>      (unless some modern browser tells it different)
>> form-encoding : UTF-8
>>    - meaning: this is Cocoon's default encoding for forms
>> Make this change to o.a.c.environment.http.HttpEnvironment's  
>> constructor :
>> change :
>> this.request.setCharacterEncoding(defaultFormEncoding);
>> this.request.setContainerEncoding(containerEncoding);
>> to:
>> if (req.getCharacterEncoding() == null) { // use the value from  
>> web.xml
>>    this.request.setContainerEncoding(containerEncoding != null ?  
>> containerEncoding : "ISO-8859-1");
>> } else { // use what we have been given
>>    this.request.setContainerEncoding(req.getCharacterEncoding());
>> }
>> this.request.setCharacterEncoding(defaultFormEncoding != null ?  
>> defaultFormEncoding : "UTF-8");
>> Then cleanup o.a.c.environment.http.HttpRequest methods :
>> public String getParameter(String name) {
>>    String value = this.req.getParameter(name);
>>    if (!this.container_encoding.equals(this.form_encoding)) {
>>        value = decode(value);
>>    }
>>    return value;
>> }
>> private String decode(String str) {
>>    if (str == null) return null;
>>    try {
>>        byte[] bytes = str.getBytes(this.container_encoding);
>>        return new String(bytes, this.form_encoding);
>>    } catch (UnsupportedEncodingException uee) {
>>        throw new CascadingRuntimeException("Unsupported Encoding  
>> Exception", uee);
>>    }
>> }
>> public String[] getParameterValues(String name) {
>>    String[] values = this.req.getParameterValues(name);
>>    if (values == null) return null;
>>    if (this.container_encoding.equals(this.form_encoding)) {
>>        return values;
>>    }
>>    String[] decoded_values = new String[values.length];
>>    for (int i = 0; i < values.length; ++i) {
>>        decoded_values[i] = decode(values[i]);
>>    }
>>    return decoded_values;
>> }
>> So we only guess at the encoding, if we really don't know what it is.
>> My understanding is that TomCat also returns null for  
>> getCharacterEncoding() if the default encoding is being used, but I  
>> do not know if it responds properly to a Content-Type header with a  
>> charset in it.
>> My guess is that either browsers sending proper Content-Type (with  
>> a charset) and/or ServletEngines responding properly to it, must be  
>> a relatively recent development.
>> This is not tested outside of :
>>    MacOSX, FireFox3, Safari, Jetty
>> If you have got this far, and would be willing to test this in  
>> other environments, it would be most helpful.
> The code responsible for all these conversions is a really old one  
> so I guess will need to check it again.
> Before I start to test your proposal I'll add a little bit of  
> complexity to your picture. You seem to forgot about other data  
> encodings like multipart/form-data. If you enable it by setting:
>  <form enctype="multipart/form-data" ...>
> Then browser will encode form data using completely different  
> method. As you probably guess, then problems occur as well.

Yes, I was expecting that.
Upgrading CForms upload widget is on my long list ..... I guess you  
just bumped it forward a few places :)

There is also maybe work to do in the portal .... Carsten? ;)

> Our own problem with multipart/form-data is that file names of  
> uploaded files are not correctly decoded. You can easily check it  
> using following sample in Cocoon:
> (try this sample with ajax mode on and off and with non-latin  
> characters both in file name and in a text field)
> There is even bug report about this issue:
> Another interesting option would be to replace our own handling of  
> multipart requests with commons-upload code, see:
> What do you think about the last proposal?

I need a bit of time to dig into this .....

> Now I'm going to test fix proposed by you...

Many thanks!

regards Jeremy

View raw message