Return-Path: Delivered-To: apmail-apr-dev-archive@www.apache.org Received: (qmail 79214 invoked from network); 7 Aug 2007 00:12:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Aug 2007 00:12:05 -0000 Received: (qmail 34903 invoked by uid 500); 7 Aug 2007 00:12:03 -0000 Delivered-To: apmail-apr-dev-archive@apr.apache.org Received: (qmail 34845 invoked by uid 500); 7 Aug 2007 00:12:03 -0000 Mailing-List: contact dev-help@apr.apache.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Id: Delivered-To: mailing list dev@apr.apache.org Received: (qmail 34834 invoked by uid 99); 7 Aug 2007 00:12:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Aug 2007 17:12:03 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.97.132.83] (HELO spaceymail-a3.g.dreamhost.com) (208.97.132.83) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Aug 2007 00:11:59 +0000 Received: from [10.2.8.55] (wsip-70-183-62-251.oc.oc.cox.net [70.183.62.251]) by spaceymail-a3.g.dreamhost.com (Postfix) with ESMTP id 2533E1951F7; Mon, 6 Aug 2007 17:11:38 -0700 (PDT) In-Reply-To: <811D919C-AB44-49F6-B24C-92E7792770C6@wsanchez.net> References: <20070717122529.GB2587@redhat.com> <20070718091148.GA23522@redhat.com> <3015D969-3590-4387-B5EA-5EBFB38ECBA3@wsanchez.net> <811D919C-AB44-49F6-B24C-92E7792770C6@wsanchez.net> Mime-Version: 1.0 (Apple Message framework v752.2) Content-Type: text/plain; charset=ISO-8859-1; delsp=yes; format=flowed Message-Id: Cc: Joe Orton , APR Developer List Content-Transfer-Encoding: quoted-printable From: "Roy T. Fielding" Subject: Re: apr_filepath_encoding on Darwin Date: Mon, 6 Aug 2007 17:11:25 -0700 To: =?ISO-8859-1?Q?Wilfredo_S=E1nchez_Vega?= X-Mailer: Apple Mail (2.752.2) X-Virus-Checked: Checked by ClamAV on apache.org On Aug 6, 2007, at 4:10 PM, Wilfredo S=E1nchez Vega wrote: > (Sorry for the lame reply latency.) > > On Jul 18, 2007, at 5:24 PM, Roy T. Fielding wrote: > >> A system less concerned with backwards compatibility is better off >> with a requirement of utf-8, though OS X should have made the =20 >> filename >> encoding a mount option. > > I disagree. Having one encoding is far superior to every =20 > application having to first find out what encoding the filesystem =20 > is question is using then using that. > > I see no value in having different mount points use different =20 > encodings. Well, neither do I (now that utf-8 exists), but the fact is that they do and they aren't necessarily controlled by the same OS. >> I assume that the ISO9660-Joliet (CD-ROM) driver does >> some form of filename translation automatically from UCS-2. > > The underlying volume format can use whatever it wants. Ideally =20 > the format defines what that is. Unfortunately, that's not the =20 > case, but for those that do, yes, converting to UTF-8 is the =20 > responsibility of the file system at the VFS layer. > > I suppose that a mount option to tell the filesystem that "this =20 > UFS volume uses encoding X" would be useful, but I maintain that =20 > above the kernel, you really want one encoding, not N. Helping the =20= > kernel know what's underneath is certainly useful. I agree. But is it the case that non-native mounted filesystems are name-translated by the kernel? I mean, if OS X did this =20 consistently for all mount points, then I would see it as being reasonable for the OS X applications to reject anything else. >> In any case, even with the convention, it is left to the application >> to determine how it will treat encoded filenames. The OS X decision >> to treat them all as utf-8 is at least consistent. OTOH, this >> is just a display convention -- OS X apps should have been designed >> to treat the filename internally as an opaque nul-terminated array, >> rather than barfing on non-utf8 encodings. > > This is difficult in practice. When the open panel sees a file =20 > that is not in UTF-8, there is no reliable way to display anything =20 > sane to the user. I suppose a Linux nerd might say "show me some =20 > hex" or something, but most of our users are not Linux nerds. I =20 > agree that crashing is worse than hex, though. Actually, it also crashes on valid utf-8 in normal form, because OS X doesn't follow the standard on normalization. See "man -s 5 utf8": If more than a single representation of a value exists (for =20 example, 0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is =20= always used. Longer ones are detected as an error as they pose a =20 potential security risk, and destroy the 1:1 character:octet sequence =20 mapping. but OS X requires the longer composition characters over shorter ones. My guess is that choice was driven by the way the UI allows such characters to be composed (like "alt-u u" for uumlaut). Of course, even with these issues, the Mac still kicks ass. > Basically, on Mac OS X, you can, in fact, use whatever characters =20= > you like on UFS and BSD level software tends to cope with that. But =20= > if you aren't using UTF-8, then you aren't writing file name that =20 > are meant for user consumption. ie. that may be OK for a database =20 > (eg. fsfs), though I think that even in that case you can =20 > reasonably stick to ASCII in many cases. > >> One thing I miss in OS X is an automated way for file archivers >> (like unzip) to recognize and convert non-utf-8 filenames >> when they are unarchived. I frequently have to do that by hand >> after unzipping something from China or Switzerland. > > Again, same as with volume formats, if the zip file format =20 > defines the encoding in zip files, then this should be easy =20 > (insofar as encodings are easy) for the software to deal with. Sadly, it doesn't (filenames are just null-terminated strings). There are options for conversion from EBCDIC, but nothing to transcode the filenames in general as they are unzipped. Maybe the zip command maintainer will take that as an enhancement request. >> Subversion >> breaks on OS X whenever someone commits a filename with an e-grave, >> which is a problem when your main product name is Communiqu=E9. >> I wonder if this change in APR would fix that error? > > You still have to hope that the inbound encoding is correct (that =20= > is, that svn somehow knows it). On OS X, that's easy; it's UTF-8. =20= > Once other operating systems come into the mix, it'll works as well =20= > as the encodings are defined (and known to svn) on those systems. What I do currently is define setenv MM_CHARSET "utf-8" setenv LANG "en_US.utf-8" in my shell init file. ....Roy