cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Portier <...@outerthought.org>
Subject Re: [Help]How can I use non-ascii file name?
Date Thu, 19 Aug 2004 08:15:47 GMT


Pier Fumagalli wrote:

> On 17 Aug 2004, at 16:20, Marc Portier wrote:
> 
>>> How about setting it up as the default behavior for Cocoon's 
>>> internal  Jetty distro?
>>
>>
>> makes sense, but: (whishing all this brokenness wan't there but helas)
> 
> 
> It's not really "brokenness" but more along the lines of an inversion  
> of the Robustness Principle, as outlined by J. Postel in RFC-791  
> (http://www.rfc-editor.org/rfc/rfc791.txt section 3.2) and later  
> dogmatized by R. Braden in RFC-1122  
> (http://www.rfc-editor.org/rfc/rfc1122.txt Section 1.2.2).
> 
> "Be liberal in what you accept, and conservative in what you send."
> 
> In this case browsers are liberal in what they send (URL-Encoded UTF-8)  
> and servlet containers are conservative in what they accept  
> (URL-Encoded ISO-8859-1).
> 

indeed

>> - it shouldn't keep us from actually get about solving it for all
>> containers? (my guess is that just a fraction of cocoon deployments
>> actually run on the internal jetty distro, i.e. using the cocoon.sh or
>> .bat?)
> 
> 
> Well, we found that Jetty in production was much better than anyone  
> else. So, in our production environment we have Jetty (not the Cocoon  
> distro one, a full blown copy)... Works pretty neatly! :-P
> 
>> - learning about this org.mortbay.util.URI.charset property we should
>> probably use it to override (or at least log-warn deployers if it's
>> different to) the container-encoding setting in the web.xml
>> (assuming that the mentioned property will also be in effect when
>> decoding the request parameters, and taking in account that current
>> cocoon code assumes ISO-8859-1 as the default there)
> 
> 
> I agree, but as I said, my world revolves around the best container in  
> the world (whops, Jetty), so I already have "my" fix to the problem:  
> switch! :-P
> 
>> - once we've run that far, we might even consider making a scan of  other
>> servlet containers and how they possibly allow setting the
>> container-encoding?
> 
> 
> The "conteiner-encoding" servlet initialization parameter simply  
> applies for request parameters (form data), and I suppose it only  
> affects how the way in which from the ServletRequest.getInputStream()  
> we read full blown characters, and parse forms.
> 

I'ld need to check but assume the request params are included regardless 
off the GET or POST method

of course the uri-part before ? would need to been used already 
internally in the servlet container at least to point to the correct JSP 
or servlet...

hm, I'ld need to try-out some jsp/servlet with a euro-sign in the 
file-name or so and check whether the path indication in the web.xml is 
able to find it...

>> while typing I started rethinking why we ended up with this
>> container-encoding init-param in web.xml?
>>
>> IIRC we did that because of required compliance to servlet spec  versions
>> prior to 2.3?  So first question is are we still on servlet 2.2?
>>

Just found the thread that answers the question:
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=108858029423811&w=2

>> If not: Since 2.3 there exists a setCharacterEncoding()
>> <quote from="servlet 2.3 javadoc"
>> href="http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/ 
>> ServletRequest.html#setCharacterEncoding(java.lang.String)">
>>   Overrides the name of the character encoding used in the body of this
>>   request. This method must be called prior to reading request
>>   parameters or reading input using getReader().
>> </quote>
> 
> 
> Indeed, the problem here is that it's nowhere specified how the request  
> BODY (not the URL, source of this problem) should be encoded.
> 

yep, but as stated above: I suppose that the border-case 'request-params 
in GET mode' is included (even if those are -stricktly speaking- not in 
the body?).

This seems to suggest that the current use of the en-re-decoding trick 
in cocoon's request-wrapper could be cleaned out (since we voted to go 
with 2.3 from now on)

> Normally, from browser behaviour, I can see that usually browsers tend  
> to post application/www-form-urlencoded in the same charset they used  
> interpreting the form. So given an HTTP request like this:
> 
> C: GET /myForm HTTP/1.1
> C: Host: localhost:80
> C:
> S: HTTP/1.1 200 OK
> S: Date: Wed, 18 Aug 2004 08:30:28 GMT
> S: Server: Apache/2.0.49 (Unix) DAV/2 SVN/1.0.2
> S: Content-Type: text/html; charset=utf-8
> 
> When the form included in /myForm is posted back to its action, the  
> UTF-8 charset will be used to encode the form data...
> 
> That's normally a rule of thumb, and that's why (IMVHO) UTF-8 should be  
> used for all forms, and should always used be as the default encoding  
> for writing and riding.
> 

yep,
we have wiki info already indicating that to our users:
http://wiki.apache.org/cocoon/RequestParameterEncoding

(hm, more interesting stuff out there, and probably some of the new 
viewpoints from this thread could be added there)


>> - I assume the cocoon servlet could easily arrange for calling the
>> method before anything else
> 
> 
> Yes, hoping that it actually works. But cocoon should call the method  
> with the encoding used to send the form from where data is read...  

yep, they should be consistent.
fact is there was a patch on the serializers to do so by default

(but the other way around: by default they are taking the setting of 
form_encoding init param for doing the serialization)

fixcommit here:
http://cvs.apache.org/viewcvs.cgi/cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java?r1=24666&r2=26246&p1=cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java&p2=cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java&diff_format=h&root=Apache-SVN

archived discussion here: 
http://marc.theaimsgroup.com/?t=106760662600010&r=1&w=2

> should be easy for continuations, but in most of the cases, I'd say  
> that it's a good principle to choose one encoding for your entire  
> application and stick to it...
> 

agree, just running through the (above mentioned) wiki page however I 
noticed some paragraph on wanting to 'locally' override the 
form-encoding for certain pipelines (use case being support for 
different clients then only the classic browsers which might behave 
differently)

the suggested setCharacterEncodingAction seems to be a good match to 
that issue and it somewhat suggests we should keep some form of possible 
en-re-decoding scheme in our request-wrapper (looks like the 2.3 switch 
should not make us jump to hasty conclusions on that part)

(boy this issue seems to be a rose with many thorns, and it seems to 
blossom every year or so :-))

>> - I'm a bit unsure here if the javadoc mentioning of 'in the body of
>> this request' is going to be interpreted by implementations as a
>> limiting scope, and if so if they include the URI (and the request
>> params using get vs post) as part of it or not
> 
> 
> The point you mentioned in the spec _DOES_NOT_ include the request URI.  
> We've talked quite extensively over it while writing Servlet 2.4, which  
> (in theory) should expand more on the concepts of charset and i18n.
> 

thx for the clarrification and inside info

>> (talk about possible confusion when writing specs like this, yuk!)
> 
> 
> Well, it's a big gray area... Most of my knowledge is based on my  
> girlfriend's PC. She's japanese, and although I don't understand what's  
> all that gibberish on her screen, I can still test out few bits and  
> bobs...
> 
> For all our MacOS/X folks, if you want to try out playing with  
> different encodings and internationalization settings, close your  
> Safari, Mozilla, Firefox, and so on, go into the System Preferences and  
> drag the three "bookcase, christmas tree, lotsa-lines block"  
> (ni-hon-go) sequence of three characters right up to the top. Start  
> your browser, and then restore english (french, italian, german) up on  
> top where it was in the preferences.
> 
> Your browser will now think it's working on a Japanese PC and will do  
> everything like you were living in Tokyo.
> 
> On Windows, sorry, your best bet is to actually GO to Tokyo, and buy a  
> copy of WindowsXP in Japanese. :-(
> 

yeah testing isn't obvious as one also needs to rely on having a 
as-unicode-complete-as-they-come font so you are sure you are seeing 
what you think you are seeing...

any case: my personal testing-candidate for these cases is just using 
the euro-sign (\u20AC, utf-8: %E2%82%AC) in pathnames, filenames, 
classnames, request params and whatnot.

most european systems (even windows) would have a native encoding 
supporting the eurosign (while iso-8859-1 obviously doesn't)

geek detail: you can even use it in your Java source code:

public class \u20ACToBEF
{
...
}

(in fact java's compiler is completely unicode aware towards the source 
code: if you're sick enough you might even go about writing the keywords 
like 'public' and 'class' in their escaped unicode variants :-)
notice that you will need to be able to specify an euro-sign in the 
filename of that source though)


regards,
-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

Mime
View raw message