Return-Path: Delivered-To: apmail-apr-dev-archive@www.apache.org Received: (qmail 11463 invoked from network); 17 Jul 2007 21:47:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Jul 2007 21:47:43 -0000 Received: (qmail 70406 invoked by uid 500); 17 Jul 2007 21:47:39 -0000 Delivered-To: apmail-apr-dev-archive@apr.apache.org Received: (qmail 70366 invoked by uid 500); 17 Jul 2007 21:47:39 -0000 Mailing-List: contact dev-help@apr.apache.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Id: Delivered-To: mailing list dev@apr.apache.org Received: (qmail 70355 invoked by uid 99); 17 Jul 2007 21:47:39 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jul 2007 14:47:39 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [17.254.13.23] (HELO mail-out4.apple.com) (17.254.13.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jul 2007 14:47:33 -0700 Received: from relay8.apple.com (relay8.apple.com [17.128.113.38]) by mail-out4.apple.com (Postfix) with ESMTP id 7A291C44EFD for ; Tue, 17 Jul 2007 14:47:11 -0700 (PDT) Received: from relay8.apple.com (unknown [127.0.0.1]) by relay8.apple.com (Symantec Mail Security) with ESMTP id 66D384008F for ; Tue, 17 Jul 2007 14:47:11 -0700 (PDT) X-AuditID: 11807126-a6446bb0000007e3-d8-469d38df1f81 Received: from pucca.apple.com (pucca.apple.com [17.221.42.43]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by relay8.apple.com (Apple SCV relay) with ESMTP id 425094045A for ; Tue, 17 Jul 2007 14:47:11 -0700 (PDT) Message-Id: From: =?ISO-8859-1?Q?Wilfredo_S=E1nchez_Vega?= To: APR Developer List In-Reply-To: <20070717122529.GB2587@redhat.com> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v896) Subject: Re: apr_filepath_encoding on Darwin Date: Tue, 17 Jul 2007 14:47:10 -0700 References: <20070717122529.GB2587@redhat.com> X-Mailer: Apple Mail (2.896) X-Brightmail-Tracker: AAAAAA== X-Virus-Checked: Checked by ClamAV on apache.org On Jul 17, 2007, at 5:25 AM, Joe Orton wrote: > On Tue, Jul 17, 2007 at 02:14:25PM +0200, Erik Huelsmann wrote: >> Reading [1], I conclude that applications should pass UTF-8 to BSD >> functions such as stat() at all times. This suggests to me that >> apr_filepath_encoding() should return APR_FILEPATH_ENCODING_UTF8. >> >> Yet, looking at the sources, on any Unixy system, it returns >> APR_FILEPATH_ENCODING_LOCALE. >> >> Is this an oversight, or am I missing something else? My understanding is that in Darwin/Mac OS, all file names, when =20 accessed above the VFS layer, are, by convention, decomposed UTF-8. =20 This is confirmed by the Tech Note: http://developer.apple.com/qa/qa2001/qa1173.html At the top: In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8. Under "Returning Names", it is clear that the file system =20 implementation is expected to convert the on-disk file name encoding =20 (if known) to decomposed UTF-8: When returning names to higher layers (for example, from your VOP_READDIR entry point), you should always return decomposed names. If your underlying volume format uses precomposed names, you should convert any precomposed characters to their decomposed equivalents before returning them to the system. Note that the above is considerably easier for a file system like =20 HFS+, where we know the on-disk encodings. It's trickier for any file =20= system which doesn't specify the file name encoding, which =20 unfortunately is most. It's particularly tricky when the volume =20 format is shared across different operating systems, since other =20 systems do not, AFAICT, have well-established conventions for file =20 name encoding (*). Note also that the convention is not enforced per se (**). As a =20 result, you aren't guaranteed, even on Mac OS, that file names are =20 valid UTF-8 (***). That poses interesting problems. For example, =20 CFString (and therefore basically all Mac apps) have been known to =20 barf (and crash) when given a file name which isn't UTF-8, since it is =20= typically told that it is UTF-8. pathname =3D [NSFileHandle handleWithPath: [NSString stringWithUTF8String: pathname]]; // boom! This is rare in practice, since Mac apps don't produce "illegal" =20 file names for the same reason that they can't read them. > This is deliberate; on Unix the character set used for filenames is > dictated by the locale settings (e.g. LC_CTYPE), by convention. Do you have a reference on this? I'm unaware of this convention. =20= Perhaps by "Unix", you mean "Linux" (***)? I was around when we =20 decided the above nonsense for Darwin, and I remember trying to some =20 such a reference, so I'd love to see it. > There is certainly no Unix standard which dictates that all filenames > must be UTF-8-encoded Unicode, so APR cannot enforce that. No, but as I mention above, such a standard does exist in Darwin. -wsv (*) I'm getting a vibe from Joe that Linux does, but I'm going to bet =20= that more software on Linux is unaware of the convention there and Mac =20= apps are on Mac OS (especially since most use our Toolkits, which are). (**) It is, sort of, on HFS+. But not really; if your file name =20 string is not UTF-8 but is a legal byte sequence in UTF-8, it'll be =20 stored as given. (Unless it looks like precomposed UTF-8 and gets =20 decomposed for you, which may look like corruption if that's not what =20= you expect, which is likely if you weren't thinking UTF-8. (***) The Single Unix Specification, which is probably the only =20 "authority" left regarding Unix standards, has little useful to say on =20= file name encodings. Here is everything I could find on the subject: For a filename to be portable across implementations conforming to IEEE Std 1003.1-2001, it shall consist only of the portable filename character set as defined in Portable Filename Character Set. The hyphen character shall not be used as the first character of a portable filename. Uppercase and lowercase letters shall retain their unique identities between conforming implementations. In the case of a portable pathname, the slash character may also be used. So here we know that case-insensitive file systems are non-=20 conforming. Oops. The Portable Filename Character Set is impressively weak: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 . _ - This omits space and most punctuation, which makes sense if poorly =20= written shell scripts (an unfortunate majority) are in the portability =20= target. File names are defined thusly: A name consisting of 1 to {NAME_MAX} bytes used to name a file. The characters composing the name may be selected from the set of all character values excluding the slash character and the null byte. The filenames dot and dot-dot have special meaning. A filename is sometimes referred to as a "pathname component". Clearly this allows for byte sequences that are not legally UTF-8. And we note that PATH is fairly ill-conceived: Filenames should be constructed from the portable filename character set because the use of other characters can be confusing or ambiguous in certain contexts. (For example, the use of a colon ( ':' ) in a pathname could cause ambiguity if that pathname were included in a PATH definition.) This is all I could find on file names in the specification. =97 Wilfredo S=E1nchez - wsanchez@wsanchez.net