cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Portier <...@outerthought.org>
Subject Re: [Help]How can I use non-ascii file name?
Date Mon, 16 Aug 2004 11:05:57 GMT
Pier,


As a coincidence we recently (last week) had a similar post on 
xreporter-list (which uses cocoon)

Bad news is that I didn't track it down to the bottom yet, just some 
findings below:
(in fact the odd-char-in-filename for map:read and map:mount was one of 
the first things I was going to test, seems I'm already presented with 
the results)


what I did find already was this:

Cocoon's Request.getSitemapURI() will return an assembly of 
javax.servlet.http.HttpServletRequest.getServletPath()
+ javax.servlet.http.HttpServletRequest.getPathInfo()

Servlet spec on those states they will be (url-) decoded
Thus 3 char sequences of the kind "%BYTE_HEX" will have been translated 
into single bytes. The obtained byte-sequence is then decoded using 
SOME_DECODING (my guess would be using ISO-8859-1, but haven't found yet 
if this is container specific, modifiable or hard noted in some spec. 
Only thing I found is this: 
http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but I'm 
yet unsure on how this influences servlet specs, or actual container and 
even browser implementations for that matter)


Alternatively there is:
Cocoon's Request.getRequestURI() which maps onto the
javax.servlet.http.HttpServletRequest.getRequestURI()

This one resembles the URI as transferred over the wire: ie. not 
(url-)decoded, or in other words still holding the %XX sequences


As an extra clarification on all these the servlet spec explicitely 
states: (2.3 version, page 34, section SRV4.4 Request Path Elements)
<quote>
It is important to note that, *except for URL encoding differences* 
between the request URI and the path parts, the following equation is 
always true:

requestURI = contextPath + servletPath + pathInfo
</quote>


I (for now) assume that this is the same encoding we expect 
cocoon-deploy people to specify in the 'container-encoding' 
init-parameter in the web.xml (allowing to correctly en-re-decode 
request-paramater-values in case of mismatching form and container 
encodings)




Ok, above is dull data, and not much into a direction of any solution 
yet.  My current feeling (long shot, needs time to test and try, and 
based on above assumption) is that we should

In terms of backwards compatibility I'm unsure if we could just go about 
changing the semantics (histrocally implied use of iso-8859-1 encoding) 
of getSitemapURI() or rather should deprecate and/or have a different 
method next to it?

In any case this new implementation should then probably apply the same 
kind of dirty en-re-decoding-trick

new return(getSitemapURI().getBytes(container_encoding),form_encoding)

as we do today with the request param values?

(see 
http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/cocoon/environment/http/HttpRequest.java?annotate=1.11#391
sorry for the old cvs-style link, the svn version of viewcvs doesn't 
seem to support 'annotate' ?)


For the record: the fast hack/workaround in the xreporter case was 
exactly to apply this.




Attached to this I'm also seeing the trouble of mount-points in cocoon. 
   I've seen a number of installments needing (well, 'using' at least) 
some insertion of that part-of-the-URL-that-maps-to-the-mounted-sitemap 
to be able to have links in source xml.files refer to other resources 
managed by the same mounted sitemap without the need to explicitely 
mention that part (but have it dynamically inserted by some xsl in stead).

In those occasions I've seen people mostly subtract siteMapURI from 
requestURI to obtain that prefix part. Regarding the above observations 
this algorithm will however fail due to encoding differences.

My proposal would be to not only add a method for decoding the 
sitemapURI properly, but in the mean time adding the convenience method 
to return the mounted-sitemap-part as well on the level of cocoon's request.



Above are early observations that need some backing, so comments 
welcome. (and hoping someone beats me to this since I'm lacking the time 
to pursue myself)
-marc=


Pier Fumagalli wrote:
> On 12 Aug 2004, at 12:45, roy huang wrote:
> 
>> Hi,all:
>>     Use reader to display jpg or gif is quite simple,like:
>>    <map:match pattern="*.jpg">
>>     <map:read mime-type="image/jpg" src="jpg/{1}.jpg" />
>>    </map:match>
>>    But if the file name is not ASCII but utf-8 or other encoding like 
>> 花.jpg (simplified Chinese),the resolver didn't resolve the name 
>> correctly,error occur:
>> org.apache.cocoon.ResourceNotFoundException: Error during resolving of 
>> the input stream: org.apache.excalibur.source.SourceNotFoundException: 
>> file:/C:/My 
>> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg 
>> doesn't exist.
>>
>> How can I use non-ASCII file name in cocoon?I can't find any 
>> description or help in wiki or archived mail list.
>>
>> Roy Huang
> 
> 
> It appears indeed as a bug...
> 
> I have this sitemap snippet:
> 
>     <map:match pattern="谷*">
>       <map:generate src="谷{1}.xml"/>
>       <map:transform src="welcome.xslt">
>         <map:parameter name="contextPath" value="{request:contextPath}"/>
>       </map:transform>
>       <map:serialize type="xhtml"/>
>     </map:match>
> 
> and a file on the disk called "谷理子.xml". Somewhere, when I make a 
> request for "http://localhost:8888/谷理子", the whole thing goes berserk...
> 
> Now, the URL is passed correctly, as I see that in the access log:
> 
> INFO    (2004-08-16) 10:26.36:538   [access] 
> (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????' 
> Processed by Apache Cocoon 2.1.5 in 27 milliseconds.
> 
> The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7 E7 
> 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it gets 
> lost in the process.
> 
> Now, if I modify my itemap to
> 
>     <map:match pattern="tanisatoko">
>       <map:generate src="谷理子.xml"/>
>       <map:transform src="welcome.xslt">
>         <map:parameter name="contextPath" value="{request:contextPath}"/>
>       </map:transform>
>       <map:serialize type="xhtml"/>
>     </map:match>
> 
> And I make a request to "http://localhost:8888/tanisatoko", the thing 
> works perfectly. We can safely exclude the fact that it's the generation 
> process.
> 
> Now, the _odd_ thing I noticed is that in those cases, I get an error of 
> "PipelineNotFound", not a "ResourceNotFound", which means that the 
> matcher seriously doesn't see that request.
> 
> Changing over the matcher to a 'regexp' matcher doesn't change, so, I 
> bet it's the data we feed to the matcher.
> 
> Now, changing that matcher to 
> "&#xe8;&#xb0;&#xb7;&#xe7;&#x90;&#x86;&#xe5;&#xad;&#x90;",
the encoding, 
> and running it again, I get my nice page correctly.
> 
> I bet that somewhere (I don't know where, but surely somewhere), the 
> UTF-8 encoded URL converted into a string using the current locale 
> (MacRoman on my system), or a default of "ISO-8859-1", before the string 
> is actually given to the sitemap.
> 
> Not having the sources at hand at the moment, I can't do a quick build 
> to put out some debugging instruction, but  you get the idea.
> 
>     Pier
> 

-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

Mime
View raw message