cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grzegorz Kossakowski <g...@tuffmail.com>
Subject Re: request encoding conundrum
Date Thu, 24 Jul 2008 08:06:37 GMT
Jeremy Quinn pisze:
> Hi All

Hi Jeremy! :-)

> I am trying to solve a nasty request transcoding bug, that I found while 
> working on CForms.

Join the club! Discovered character encoding problems two days ago in a project based on Cocoon

2.1.x. Tried to fight it yesterday and gave up.

> AFAICS this bug effects older versions as well ..... accented characters 
> not roundtripping due to bad transcoding in Cocoon, under certain 
> circumstances.
> 
> CForms works in one of two modes: ajax-on and ajax-off.
> When ajax is on, CForms submits the form via an XMLHttp Request (XHR), 
> when it is off it submits the form normally.
> 
> Servlet Requests are expected by default to be encoded using ISO-8859-1 
> (appalling choice!!!), but of course to get any real work done on the 
> international web, you should use UTF-8 (now Cocoon's default, thanks to 
> Vadim).

When I was looking at our code in HttpEnvironment, HttpRequest and in MultipartParser I started
to 
wonder if it would be an option to forget about any other encodings apart from UTF-8. According
to 
my knowledge, there is no serious software that does not support Unicode.

This would help us to clean up and simplify the code in trunk greatly so it would go into
2.3 
release (don't be afraid, you won't need to wait for it years, I promise).

The only problem is that I don't have any significant experience with such issues so I would
like to 
hear if my proposal makes sense. Would it be possible to support Unicode only?

> Browsers should post data in the encoding of the page containing the form.
> 
> Dojo always posts forms as UTF-8 when it does XHR, seemingly regardless 
> of the page encoding. Furthermore, the post has a Content-Type header : 
> "application/x-www-form-urlencoded; charset=UTF-8". (Default in 
> FireFox3, can be set in Safari, unknown in MSIE).
> 
> Jetty responds properly to the Content-Type header, by automatically 
> using that charset for decoding Request Parameters instead of the 
> default ISO-8859-1. (behaviour of other ServletEngines unknown). This 
> leads to a transcoding bug because Cocoon assumes ISO-8859-1.

I think that behaviour of Jetty is correct. Right?

> When forms are submitted normally (i.e. non-XHR) they usually do not 
> contain the Content-Type header (tested with FireFox3 & Safari) and it 
> does not seem possible to set one from JavaScript (XHR has the api to do 
> it).
> 
> So unless the user has set a different encoding for the serialisation of 
> their forms, CForms Requests will always be in UTF-8, but the 
> Content-Type header will not always specify this.
> 
> If the Content-Type header contains a charset, (at least in Jetty) no 
> further transcoding should happen. If it does not contain a charset, the 
> encoding will be default and parameters must be transcoded.
> 
> So, if the header is correctly set, Cocoon's transcoding hack 
> (o.a.c.environment.http.HttpRequest.decode) breaks, because it assumes 
> standard ISO-8859-1.
> 
> Therefore we face the situation where it is impossible to get correct 
> decoding via settings in web.xml : "container-encoding" and "form-encoding"
> that work for both ajax-on and ajax-off forms from the same instance of 
> Cocoon.
> 
> But I have a solution I think :)
> 
> I propose that the default settings in Cocoon's web.xml for 
> "container-encoding" and "form-encoding" should be :
> container-encoding : ISO-8859-1
>     - meaning: my servlet container uses this as it's default encoding
>       (unless some modern browser tells it different)
> form-encoding : UTF-8
>     - meaning: this is Cocoon's default encoding for forms
> 
> Make this change to o.a.c.environment.http.HttpEnvironment's constructor :
> change :
> this.request.setCharacterEncoding(defaultFormEncoding);
> this.request.setContainerEncoding(containerEncoding);
> 
> to:
> if (req.getCharacterEncoding() == null) { // use the value from web.xml
>     this.request.setContainerEncoding(containerEncoding != null ? 
> containerEncoding : "ISO-8859-1");
> } else { // use what we have been given
>     this.request.setContainerEncoding(req.getCharacterEncoding());
> }
> this.request.setCharacterEncoding(defaultFormEncoding != null ? 
> defaultFormEncoding : "UTF-8");
> 
> Then cleanup o.a.c.environment.http.HttpRequest methods :
> 
> public String getParameter(String name) {
>     String value = this.req.getParameter(name);
>     if (!this.container_encoding.equals(this.form_encoding)) {
>         value = decode(value);
>     }
>     return value;
> }
> 
> private String decode(String str) {
>     if (str == null) return null;
>     try {
>         byte[] bytes = str.getBytes(this.container_encoding);
>         return new String(bytes, this.form_encoding);
>     } catch (UnsupportedEncodingException uee) {
>         throw new CascadingRuntimeException("Unsupported Encoding 
> Exception", uee);
>     }
> }
> 
> public String[] getParameterValues(String name) {
>     String[] values = this.req.getParameterValues(name);
>     if (values == null) return null;
>     if (this.container_encoding.equals(this.form_encoding)) {
>         return values;
>     }
>     String[] decoded_values = new String[values.length];
>     for (int i = 0; i < values.length; ++i) {
>         decoded_values[i] = decode(values[i]);
>     }
>     return decoded_values;
> }
> 
> So we only guess at the encoding, if we really don't know what it is.
> 
> My understanding is that TomCat also returns null for 
> getCharacterEncoding() if the default encoding is being used, but I do 
> not know if it responds properly to a Content-Type header with a charset 
> in it.
> 
> My guess is that either browsers sending proper Content-Type (with a 
> charset) and/or ServletEngines responding properly to it, must be a 
> relatively recent development.
> 
> This is not tested outside of :
>     MacOSX, FireFox3, Safari, Jetty
> 
> If you have got this far, and would be willing to test this in other 
> environments, it would be most helpful.

The code responsible for all these conversions is a really old one so I guess will need to
check it 
again.

Before I start to test your proposal I'll add a little bit of complexity to your picture.
You seem 
to forgot about other data encodings like multipart/form-data. If you enable it by setting:

   <form enctype="multipart/form-data" ...>

Then browser will encode form data using completely different method. As you probably guess,
then 
problems occur as well.

Our own problem with multipart/form-data is that file names of uploaded files are not correctly

decoded. You can easily check it using following sample in Cocoon:
http://cocoon.zones.apache.org/demos/trunk/samples/forms/upload

(try this sample with ajax mode on and off and with non-latin characters both in file name
and in a 
text field)

There is even bug report about this issue:
https://issues.apache.org/jira/browse/COCOON-1917

Another interesting option would be to replace our own handling of multipart requests with

commons-upload code, see:
https://issues.apache.org/jira/browse/COCOON-1325

What do you think about the last proposal?


Now I'm going to test fix proposed by you...

-- 
Best regards,
Grzegorz Kossakowski

Mime
View raw message