cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pier Fumagalli <p...@betaversion.org>
Subject Re: [Help]How can I use non-ascii file name?
Date Wed, 18 Aug 2004 08:43:06 GMT
On 17 Aug 2004, at 16:20, Marc Portier wrote:

>> How about setting it up as the default behavior for Cocoon's internal  
>> Jetty distro?
>
> makes sense, but: (whishing all this brokenness wan't there but helas)

It's not really "brokenness" but more along the lines of an inversion  
of the Robustness Principle, as outlined by J. Postel in RFC-791  
(http://www.rfc-editor.org/rfc/rfc791.txt section 3.2) and later  
dogmatized by R. Braden in RFC-1122  
(http://www.rfc-editor.org/rfc/rfc1122.txt Section 1.2.2).

"Be liberal in what you accept, and conservative in what you send."

In this case browsers are liberal in what they send (URL-Encoded UTF-8)  
and servlet containers are conservative in what they accept  
(URL-Encoded ISO-8859-1).

> - it shouldn't keep us from actually get about solving it for all
> containers? (my guess is that just a fraction of cocoon deployments
> actually run on the internal jetty distro, i.e. using the cocoon.sh or
> .bat?)

Well, we found that Jetty in production was much better than anyone  
else. So, in our production environment we have Jetty (not the Cocoon  
distro one, a full blown copy)... Works pretty neatly! :-P

> - learning about this org.mortbay.util.URI.charset property we should
> probably use it to override (or at least log-warn deployers if it's
> different to) the container-encoding setting in the web.xml
> (assuming that the mentioned property will also be in effect when
> decoding the request parameters, and taking in account that current
> cocoon code assumes ISO-8859-1 as the default there)

I agree, but as I said, my world revolves around the best container in  
the world (whops, Jetty), so I already have "my" fix to the problem:  
switch! :-P

> - once we've run that far, we might even consider making a scan of  
> other
> servlet containers and how they possibly allow setting the
> container-encoding?

The "conteiner-encoding" servlet initialization parameter simply  
applies for request parameters (form data), and I suppose it only  
affects how the way in which from the ServletRequest.getInputStream()  
we read full blown characters, and parse forms.

> while typing I started rethinking why we ended up with this
> container-encoding init-param in web.xml?
>
> IIRC we did that because of required compliance to servlet spec  
> versions
> prior to 2.3?  So first question is are we still on servlet 2.2?
>
> If not: Since 2.3 there exists a setCharacterEncoding()
> <quote from="servlet 2.3 javadoc"
> href="http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/ 
> ServletRequest.html#setCharacterEncoding(java.lang.String)">
>   Overrides the name of the character encoding used in the body of this
>   request. This method must be called prior to reading request
>   parameters or reading input using getReader().
> </quote>

Indeed, the problem here is that it's nowhere specified how the request  
BODY (not the URL, source of this problem) should be encoded.

Normally, from browser behaviour, I can see that usually browsers tend  
to post application/www-form-urlencoded in the same charset they used  
interpreting the form. So given an HTTP request like this:

C: GET /myForm HTTP/1.1
C: Host: localhost:80
C:
S: HTTP/1.1 200 OK
S: Date: Wed, 18 Aug 2004 08:30:28 GMT
S: Server: Apache/2.0.49 (Unix) DAV/2 SVN/1.0.2
S: Content-Type: text/html; charset=utf-8

When the form included in /myForm is posted back to its action, the  
UTF-8 charset will be used to encode the form data...

That's normally a rule of thumb, and that's why (IMVHO) UTF-8 should be  
used for all forms, and should always used be as the default encoding  
for writing and riding.

> - I assume the cocoon servlet could easily arrange for calling the
> method before anything else

Yes, hoping that it actually works. But cocoon should call the method  
with the encoding used to send the form from where data is read...  
should be easy for continuations, but in most of the cases, I'd say  
that it's a good principle to choose one encoding for your entire  
application and stick to it...

> - I'm a bit unsure here if the javadoc mentioning of 'in the body of
> this request' is going to be interpreted by implementations as a
> limiting scope, and if so if they include the URI (and the request
> params using get vs post) as part of it or not

The point you mentioned in the spec _DOES_NOT_ include the request URI.  
We've talked quite extensively over it while writing Servlet 2.4, which  
(in theory) should expand more on the concepts of charset and i18n.

> (talk about possible confusion when writing specs like this, yuk!)

Well, it's a big gray area... Most of my knowledge is based on my  
girlfriend's PC. She's japanese, and although I don't understand what's  
all that gibberish on her screen, I can still test out few bits and  
bobs...

For all our MacOS/X folks, if you want to try out playing with  
different encodings and internationalization settings, close your  
Safari, Mozilla, Firefox, and so on, go into the System Preferences and  
drag the three "bookcase, christmas tree, lotsa-lines block"  
(ni-hon-go) sequence of three characters right up to the top. Start  
your browser, and then restore english (french, italian, german) up on  
top where it was in the preferences.

Your browser will now think it's working on a Japanese PC and will do  
everything like you were living in Tokyo.

On Windows, sorry, your best bet is to actually GO to Tokyo, and buy a  
copy of WindowsXP in Japanese. :-(

	Pier

Mime
View raw message