cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Quinn <jer...@apache.org>
Subject request encoding conundrum
Date Wed, 23 Jul 2008 21:18:27 GMT
Hi All

I am trying to solve a nasty request transcoding bug, that I found  
while working on CForms.

AFAICS this bug effects older versions as well ..... accented  
characters not roundtripping due to bad transcoding in Cocoon, under  
certain circumstances.

CForms works in one of two modes: ajax-on and ajax-off.
When ajax is on, CForms submits the form via an XMLHttp Request (XHR),  
when it is off it submits the form normally.

Servlet Requests are expected by default to be encoded using  
ISO-8859-1 (appalling choice!!!), but of course to get any real work  
done on the international web, you should use UTF-8 (now Cocoon's  
default, thanks to Vadim).

Browsers should post data in the encoding of the page containing the  
form.

Dojo always posts forms as UTF-8 when it does XHR, seemingly  
regardless of the page encoding. Furthermore, the post has a Content- 
Type header : "application/x-www-form-urlencoded; charset=UTF-8".  
(Default in FireFox3, can be set in Safari, unknown in MSIE).

Jetty responds properly to the Content-Type header, by automatically  
using that charset for decoding Request Parameters instead of the  
default ISO-8859-1. (behaviour of other ServletEngines unknown). This  
leads to a transcoding bug because Cocoon assumes ISO-8859-1.

When forms are submitted normally (i.e. non-XHR) they usually do not  
contain the Content-Type header (tested with FireFox3 & Safari) and it  
does not seem possible to set one from JavaScript (XHR has the api to  
do it).

So unless the user has set a different encoding for the serialisation  
of their forms, CForms Requests will always be in UTF-8, but the  
Content-Type header will not always specify this.

If the Content-Type header contains a charset, (at least in Jetty) no  
further transcoding should happen. If it does not contain a charset,  
the encoding will be default and parameters must be transcoded.

So, if the header is correctly set, Cocoon's transcoding hack  
(o.a.c.environment.http.HttpRequest.decode) breaks, because it assumes  
standard ISO-8859-1.

Therefore we face the situation where it is impossible to get correct  
decoding via settings in web.xml : "container-encoding" and "form- 
encoding"
that work for both ajax-on and ajax-off forms from the same instance  
of Cocoon.

But I have a solution I think :)

I propose that the default settings in Cocoon's web.xml for "container- 
encoding" and "form-encoding" should be :
container-encoding : ISO-8859-1
     - meaning: my servlet container uses this as it's default encoding
       (unless some modern browser tells it different)
form-encoding : UTF-8
     - meaning: this is Cocoon's default encoding for forms

Make this change to o.a.c.environment.http.HttpEnvironment's  
constructor :
change :
this.request.setCharacterEncoding(defaultFormEncoding);
this.request.setContainerEncoding(containerEncoding);

to:
if (req.getCharacterEncoding() == null) { // use the value from web.xml
     this.request.setContainerEncoding(containerEncoding != null ?  
containerEncoding : "ISO-8859-1");
} else { // use what we have been given
     this.request.setContainerEncoding(req.getCharacterEncoding());
}
this.request.setCharacterEncoding(defaultFormEncoding != null ?  
defaultFormEncoding : "UTF-8");

Then cleanup o.a.c.environment.http.HttpRequest methods :

public String getParameter(String name) {
     String value = this.req.getParameter(name);
     if (!this.container_encoding.equals(this.form_encoding)) {
         value = decode(value);
     }
     return value;
}

private String decode(String str) {
     if (str == null) return null;
     try {
         byte[] bytes = str.getBytes(this.container_encoding);
         return new String(bytes, this.form_encoding);
     } catch (UnsupportedEncodingException uee) {
         throw new CascadingRuntimeException("Unsupported Encoding  
Exception", uee);
     }
}

public String[] getParameterValues(String name) {
     String[] values = this.req.getParameterValues(name);
     if (values == null) return null;
     if (this.container_encoding.equals(this.form_encoding)) {
         return values;
     }
     String[] decoded_values = new String[values.length];
     for (int i = 0; i < values.length; ++i) {
         decoded_values[i] = decode(values[i]);
     }
     return decoded_values;
}

So we only guess at the encoding, if we really don't know what it is.

My understanding is that TomCat also returns null for  
getCharacterEncoding() if the default encoding is being used, but I do  
not know if it responds properly to a Content-Type header with a  
charset in it.

My guess is that either browsers sending proper Content-Type (with a  
charset) and/or ServletEngines responding properly to it, must be a  
relatively recent development.

This is not tested outside of :
	MacOSX, FireFox3, Safari, Jetty

If you have got this far, and would be willing to test this in other  
environments, it would be most helpful.


best regards

Jeremy



Mime
View raw message