cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pier Fumagalli <p...@betaversion.org>
Subject Re: [Help]How can I use non-ascii file name?
Date Mon, 16 Aug 2004 13:02:45 GMT
Ok, I tracked the sucker down... It's the servlet container... They all  
decode the stupid URL using ISO-8859-1... And therefore, utterly  
incompatible with 3/4 of the non-english-speaking world...

At best, I was able to _HACK_ the whole thing through, by getting the  
path info in this way:

<WARNING note="shit-code-follows">

new String(request.getPathInfo().getBytes("ISO-8859-1"),"UTF-8"));

</WARNING>

Therefore, I get the BYTES of the path-info string as if they were in  
ISO-8859-1, and re-create a new string by taking those bytes and  
forcing them to be in UTF-8...

Niiiiiiiiiiiiiiiiiiice!

Note that this stupidity also happens with accented letters (that for  
us Italians is a big p-i-t-a).

I'll see why this happens in Jetty, I'll poke Jen and Greg to have  
either a fix, or an explaination and workaround... For now, brrrr, I  
think that the hack is the only way to go...

Oh, I checked it also on Tomcat. Same problem there as well...

	Pier



On 16 Aug 2004, at 12:05, Marc Portier wrote:

> Pier,
>
>
> As a coincidence we recently (last week) had a similar post on  
> xreporter-list (which uses cocoon)
>
> Bad news is that I didn't track it down to the bottom yet, just some  
> findings below:
> (in fact the odd-char-in-filename for map:read and map:mount was one  
> of the first things I was going to test, seems I'm already presented  
> with the results)
>
>
> what I did find already was this:
>
> Cocoon's Request.getSitemapURI() will return an assembly of  
> javax.servlet.http.HttpServletRequest.getServletPath()
> + javax.servlet.http.HttpServletRequest.getPathInfo()
>
> Servlet spec on those states they will be (url-) decoded
> Thus 3 char sequences of the kind "%BYTE_HEX" will have been  
> translated into single bytes. The obtained byte-sequence is then  
> decoded using SOME_DECODING (my guess would be using ISO-8859-1, but  
> haven't found yet if this is container specific, modifiable or hard  
> noted in some spec. Only thing I found is this:  
> http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but  
> I'm yet unsure on how this influences servlet specs, or actual  
> container and even browser implementations for that matter)
>
>
> Alternatively there is:
> Cocoon's Request.getRequestURI() which maps onto the
> javax.servlet.http.HttpServletRequest.getRequestURI()
>
> This one resembles the URI as transferred over the wire: ie. not  
> (url-)decoded, or in other words still holding the %XX sequences
>
>
> As an extra clarification on all these the servlet spec explicitely  
> states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
> <quote>
> It is important to note that, *except for URL encoding differences*  
> between the request URI and the path parts, the following equation is  
> always true:
>
> requestURI = contextPath + servletPath + pathInfo
> </quote>
>
>
> I (for now) assume that this is the same encoding we expect  
> cocoon-deploy people to specify in the 'container-encoding'  
> init-parameter in the web.xml (allowing to correctly en-re-decode  
> request-paramater-values in case of mismatching form and container  
> encodings)
>
>
>
>
> Ok, above is dull data, and not much into a direction of any solution  
> yet.  My current feeling (long shot, needs time to test and try, and  
> based on above assumption) is that we should
>
> In terms of backwards compatibility I'm unsure if we could just go  
> about changing the semantics (histrocally implied use of iso-8859-1  
> encoding) of getSitemapURI() or rather should deprecate and/or have a  
> different method next to it?
>
> In any case this new implementation should then probably apply the  
> same kind of dirty en-re-decoding-trick
>
> new return(getSitemapURI().getBytes(container_encoding),form_encoding)
>
> as we do today with the request param values?
>
> (see  
> http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/ 
> cocoon/environment/http/HttpRequest.java?annotate=1.11#391
> sorry for the old cvs-style link, the svn version of viewcvs doesn't  
> seem to support 'annotate' ?)
>
>
> For the record: the fast hack/workaround in the xreporter case was  
> exactly to apply this.
>
>
>
>
> Attached to this I'm also seeing the trouble of mount-points in  
> cocoon.   I've seen a number of installments needing (well, 'using' at  
> least) some insertion of that  
> part-of-the-URL-that-maps-to-the-mounted-sitemap to be able to have  
> links in source xml.files refer to other resources managed by the same  
> mounted sitemap without the need to explicitely mention that part (but  
> have it dynamically inserted by some xsl in stead).
>
> In those occasions I've seen people mostly subtract siteMapURI from  
> requestURI to obtain that prefix part. Regarding the above  
> observations this algorithm will however fail due to encoding  
> differences.
>
> My proposal would be to not only add a method for decoding the  
> sitemapURI properly, but in the mean time adding the convenience  
> method to return the mounted-sitemap-part as well on the level of  
> cocoon's request.
>
>
>
> Above are early observations that need some backing, so comments  
> welcome. (and hoping someone beats me to this since I'm lacking the  
> time to pursue myself)
> -marc=
>
>
> Pier Fumagalli wrote:
>> On 12 Aug 2004, at 12:45, roy huang wrote:
>>> Hi,all:
>>>     Use reader to display jpg or gif is quite simple,like:
>>>    <map:match pattern="*.jpg">
>>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>>    </map:match>
>>>    But if the file name is not ASCII but utf-8 or other encoding  
>>> like 花.jpg (simplified Chinese),the resolver didn't resolve the name  
>>> correctly,error occur:
>>> org.apache.cocoon.ResourceNotFoundException: Error during resolving  
>>> of the input stream:  
>>> org.apache.excalibur.source.SourceNotFoundException: file:/C:/My  
>>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg  
>>> doesn't exist.
>>>
>>> How can I use non-ASCII file name in cocoon?I can't find any  
>>> description or help in wiki or archived mail list.
>>>
>>> Roy Huang
>> It appears indeed as a bug...
>> I have this sitemap snippet:
>>     <map:match pattern="谷*">
>>       <map:generate src="谷{1}.xml"/>
>>       <map:transform src="welcome.xslt">
>>         <map:parameter name="contextPath"  
>> value="{request:contextPath}"/>
>>       </map:transform>
>>       <map:serialize type="xhtml"/>
>>     </map:match>
>> and a file on the disk called "谷理子.xml". Somewhere, when I make a  
>> request for "http://localhost:8888/谷理子", the whole thing goes  
>> berserk...
>> Now, the URL is passed correctly, as I see that in the access log:
>> INFO    (2004-08-16) 10:26.36:538   [access]  
>> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????'  
>> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
>> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7  
>> E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it  
>> gets lost in the process.
>> Now, if I modify my itemap to
>>     <map:match pattern="tanisatoko">
>>       <map:generate src="谷理子.xml"/>
>>       <map:transform src="welcome.xslt">
>>         <map:parameter name="contextPath"  
>> value="{request:contextPath}"/>
>>       </map:transform>
>>       <map:serialize type="xhtml"/>
>>     </map:match>
>> And I make a request to "http://localhost:8888/tanisatoko", the thing  
>> works perfectly. We can safely exclude the fact that it's the  
>> generation process.
>> Now, the _odd_ thing I noticed is that in those cases, I get an error  
>> of "PipelineNotFound", not a "ResourceNotFound", which means that the  
>> matcher seriously doesn't see that request.
>> Changing over the matcher to a 'regexp' matcher doesn't change, so, I  
>> bet it's the data we feed to the matcher.
>> Now, changing that matcher to  
>> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;",
the  
>> encoding, and running it again, I get my nice page correctly.
>> I bet that somewhere (I don't know where, but surely somewhere), the  
>> UTF-8 encoded URL converted into a string using the current locale  
>> (MacRoman on my system), or a default of "ISO-8859-1", before the  
>> string is actually given to the sitemap.
>> Not having the sources at hand at the moment, I can't do a quick  
>> build to put out some debugging instruction, but  you get the idea.
>>     Pier
>
> -- 
> Marc Portier                            http://outerthought.org/
> Outerthought - Open Source, Java & XML Competence Support Center
> Read my weblog at                http://blogs.cocoondev.org/mpo/
> mpo@outerthought.org                              mpo@apache.org
>

Mime
View raw message