Return-Path: Delivered-To: apmail-cocoon-users-archive@www.apache.org Received: (qmail 48143 invoked from network); 16 Aug 2004 11:06:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 16 Aug 2004 11:06:30 -0000 Received: (qmail 34130 invoked by uid 500); 16 Aug 2004 11:06:11 -0000 Delivered-To: apmail-cocoon-users-archive@cocoon.apache.org Received: (qmail 34026 invoked by uid 500); 16 Aug 2004 11:06:10 -0000 Mailing-List: contact users-help@cocoon.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: users@cocoon.apache.org Delivered-To: mailing list users@cocoon.apache.org Received: (qmail 33997 invoked by uid 99); 16 Aug 2004 11:06:10 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=WEIRD_PORT X-Spam-Check-By: apache.org Received: from [195.144.64.135] (HELO smtp1.xs4all.be) (195.144.64.135) by apache.org (qpsmtpd/0.27.1) with ESMTP; Mon, 16 Aug 2004 04:06:06 -0700 Received: from [193.74.195.45] ([193.75.212.66]) (authenticated bits=0) by smtp1.xs4all.be (8.12.10/8.12.10) with ESMTP id i7GB62YN001240; Mon, 16 Aug 2004 13:06:03 +0200 Message-ID: <41209515.6040109@outerthought.org> Date: Mon, 16 Aug 2004 13:05:57 +0200 From: Marc Portier Organization: Outerthought User-Agent: Mozilla Thunderbird 0.7 (Windows/20040616) X-Accept-Language: en-us, en MIME-Version: 1.0 To: dev@cocoon.apache.org CC: users@cocoon.apache.org Subject: Re: [Help]How can I use non-ascii file name? References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Pier, As a coincidence we recently (last week) had a similar post on xreporter-list (which uses cocoon) Bad news is that I didn't track it down to the bottom yet, just some findings below: (in fact the odd-char-in-filename for map:read and map:mount was one of the first things I was going to test, seems I'm already presented with the results) what I did find already was this: Cocoon's Request.getSitemapURI() will return an assembly of javax.servlet.http.HttpServletRequest.getServletPath() + javax.servlet.http.HttpServletRequest.getPathInfo() Servlet spec on those states they will be (url-) decoded Thus 3 char sequences of the kind "%BYTE_HEX" will have been translated into single bytes. The obtained byte-sequence is then decoded using SOME_DECODING (my guess would be using ISO-8859-1, but haven't found yet if this is container specific, modifiable or hard noted in some spec. Only thing I found is this: http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but I'm yet unsure on how this influences servlet specs, or actual container and even browser implementations for that matter) Alternatively there is: Cocoon's Request.getRequestURI() which maps onto the javax.servlet.http.HttpServletRequest.getRequestURI() This one resembles the URI as transferred over the wire: ie. not (url-)decoded, or in other words still holding the %XX sequences As an extra clarification on all these the servlet spec explicitely states: (2.3 version, page 34, section SRV4.4 Request Path Elements) It is important to note that, *except for URL encoding differences* between the request URI and the path parts, the following equation is always true: requestURI = contextPath + servletPath + pathInfo I (for now) assume that this is the same encoding we expect cocoon-deploy people to specify in the 'container-encoding' init-parameter in the web.xml (allowing to correctly en-re-decode request-paramater-values in case of mismatching form and container encodings) Ok, above is dull data, and not much into a direction of any solution yet. My current feeling (long shot, needs time to test and try, and based on above assumption) is that we should In terms of backwards compatibility I'm unsure if we could just go about changing the semantics (histrocally implied use of iso-8859-1 encoding) of getSitemapURI() or rather should deprecate and/or have a different method next to it? In any case this new implementation should then probably apply the same kind of dirty en-re-decoding-trick new return(getSitemapURI().getBytes(container_encoding),form_encoding) as we do today with the request param values? (see http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/cocoon/environment/http/HttpRequest.java?annotate=1.11#391 sorry for the old cvs-style link, the svn version of viewcvs doesn't seem to support 'annotate' ?) For the record: the fast hack/workaround in the xreporter case was exactly to apply this. Attached to this I'm also seeing the trouble of mount-points in cocoon. I've seen a number of installments needing (well, 'using' at least) some insertion of that part-of-the-URL-that-maps-to-the-mounted-sitemap to be able to have links in source xml.files refer to other resources managed by the same mounted sitemap without the need to explicitely mention that part (but have it dynamically inserted by some xsl in stead). In those occasions I've seen people mostly subtract siteMapURI from requestURI to obtain that prefix part. Regarding the above observations this algorithm will however fail due to encoding differences. My proposal would be to not only add a method for decoding the sitemapURI properly, but in the mean time adding the convenience method to return the mounted-sitemap-part as well on the level of cocoon's request. Above are early observations that need some backing, so comments welcome. (and hoping someone beats me to this since I'm lacking the time to pursue myself) -marc= Pier Fumagalli wrote: > On 12 Aug 2004, at 12:45, roy huang wrote: > >> Hi,all: >> Use reader to display jpg or gif is quite simple,like: >> >> >> >> But if the file name is not ASCII but utf-8 or other encoding like >> 花.jpg (simplified Chinese),the resolver didn't resolve the name >> correctly,error occur: >> org.apache.cocoon.ResourceNotFoundException: Error during resolving of >> the input stream: org.apache.excalibur.source.SourceNotFoundException: >> file:/C:/My >> Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/花.jpg >> doesn't exist. >> >> How can I use non-ASCII file name in cocoon?I can't find any >> description or help in wiki or archived mail list. >> >> Roy Huang > > > It appears indeed as a bug... > > I have this sitemap snippet: > > > > > > > > > > and a file on the disk called "谷理子.xml". Somewhere, when I make a > request for "http://localhost:8888/谷理子", the whole thing goes berserk... > > Now, the URL is passed correctly, as I see that in the access log: > > INFO (2004-08-16) 10:26.36:538 [access] > (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????' > Processed by Apache Cocoon 2.1.5 in 27 milliseconds. > > The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7 E7 > 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it gets > lost in the process. > > Now, if I modify my itemap to > > > > > > > > > > And I make a request to "http://localhost:8888/tanisatoko", the thing > works perfectly. We can safely exclude the fact that it's the generation > process. > > Now, the _odd_ thing I noticed is that in those cases, I get an error of > "PipelineNotFound", not a "ResourceNotFound", which means that the > matcher seriously doesn't see that request. > > Changing over the matcher to a 'regexp' matcher doesn't change, so, I > bet it's the data we feed to the matcher. > > Now, changing that matcher to > "谷理子", the encoding, > and running it again, I get my nice page correctly. > > I bet that somewhere (I don't know where, but surely somewhere), the > UTF-8 encoded URL converted into a string using the current locale > (MacRoman on my system), or a default of "ISO-8859-1", before the string > is actually given to the sitemap. > > Not having the sources at hand at the moment, I can't do a quick build > to put out some debugging instruction, but you get the idea. > > Pier > -- Marc Portier http://outerthought.org/ Outerthought - Open Source, Java & XML Competence Support Center Read my weblog at http://blogs.cocoondev.org/mpo/ mpo@outerthought.org mpo@apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org For additional commands, e-mail: users-help@cocoon.apache.org