apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilfredo Sánchez Vega <wsanc...@wsanchez.net>
Subject Re: apr_filepath_encoding on Darwin
Date Tue, 17 Jul 2007 21:47:10 GMT
On Jul 17, 2007, at 5:25 AM, Joe Orton wrote:

> On Tue, Jul 17, 2007 at 02:14:25PM +0200, Erik Huelsmann wrote:
>> Reading [1], I conclude that applications should pass UTF-8 to BSD
>> functions such as stat() at all times. This suggests to me that
>> apr_filepath_encoding() should return APR_FILEPATH_ENCODING_UTF8.
>>
>> Yet, looking at the sources, on any Unixy system, it returns
>> APR_FILEPATH_ENCODING_LOCALE.
>>
>> Is this an oversight, or am I missing something else?

   My understanding is that in Darwin/Mac OS, all file names, when  
accessed above the VFS layer, are, by convention, decomposed UTF-8.   
This is confirmed by the Tech Note:

     http://developer.apple.com/qa/qa2001/qa1173.html

   At the top:

	In Mac OS X's VFS API file names are, by definition,
	canonically decomposed Unicode, encoded using UTF-8.

   Under "Returning Names", it is clear that the file system  
implementation is expected to convert the on-disk file name encoding  
(if known) to decomposed UTF-8:

	When returning names to higher layers (for example,
	from your VOP_READDIR entry point), you should always
	return decomposed names. If your underlying volume
	format uses precomposed names, you should convert
	any precomposed characters to their decomposed
	equivalents before returning them to the system.

   Note that the above is considerably easier for a file system like  
HFS+, where we know the on-disk encodings.  It's trickier for any file  
system which doesn't specify the file name encoding, which  
unfortunately is most.  It's particularly tricky when the volume  
format is shared across different operating systems, since other  
systems do not, AFAICT, have well-established conventions for file  
name encoding (*).

   Note also that the convention is not enforced per se (**).  As a  
result, you aren't guaranteed, even on Mac OS, that file names are  
valid UTF-8 (***).  That poses interesting problems.  For example,  
CFString (and therefore basically all Mac apps) have been known to  
barf (and crash) when given a file name which isn't UTF-8, since it is  
typically told that it is UTF-8.

	pathname = <some illegal UTF-8 string>
	[NSFileHandle handleWithPath:
	  [NSString stringWithUTF8String: pathname]]; // boom!

   This is rare in practice, since Mac apps don't produce "illegal"  
file names for the same reason that they can't read them.

> This is deliberate; on Unix the character set used for filenames is
> dictated by the locale settings (e.g. LC_CTYPE), by convention.

   Do you have a reference on this?  I'm unaware of this convention.   
Perhaps by "Unix", you mean "Linux" (***)?  I was around when we  
decided the above nonsense for Darwin, and I remember trying to some  
such a reference, so I'd love to see it.

> There is certainly no Unix standard which dictates that all filenames
> must be UTF-8-encoded Unicode, so APR cannot enforce that.

   No, but as I mention above, such a standard does exist in Darwin.

	-wsv



(*) I'm getting a vibe from Joe that Linux does, but I'm going to bet  
that more software on Linux is unaware of the convention there and Mac  
apps are on Mac OS (especially since most use our Toolkits, which are).

(**) It is, sort of, on HFS+.  But not really; if your file name  
string is not UTF-8 but is a legal byte sequence in UTF-8, it'll be  
stored as given. (Unless it looks like precomposed UTF-8 and gets  
decomposed for you, which may look like corruption if that's not what  
you expect, which is likely if you weren't thinking UTF-8.

(***) The Single Unix Specification, which is probably the only  
"authority" left regarding Unix standards, has little useful to say on  
file name encodings.  Here is everything I could find on the subject:

	For a filename to be portable across implementations
	conforming to IEEE Std 1003.1-2001, it shall consist
	only of the portable filename character set as defined
	in Portable Filename Character Set.

	The hyphen character shall not be used as the first
	character of a portable filename. Uppercase and
	lowercase letters shall retain their unique identities
	between conforming implementations. In the case of a
	portable pathname, the slash character may also be used.

   So here we know that case-insensitive file systems are non- 
conforming.  Oops.

   The Portable Filename Character Set is impressively weak:

	A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
	a b c d e f g h i j k l m n o p q r s t u v w x y z
	0 1 2 3 4 5 6 7 8 9 . _ -

   This omits space and most punctuation, which makes sense if poorly  
written shell scripts (an unfortunate majority) are in the portability  
target.

   File names are defined thusly:

	A name consisting of 1 to {NAME_MAX} bytes used to name
	a file. The characters composing the name may be selected
	from the set of all character values excluding the slash
	character and the null byte. The filenames dot and dot-dot
	have special meaning. A filename is sometimes referred to
	as a "pathname component".

   Clearly this allows for byte sequences that are not legally UTF-8.

   And we note that PATH is fairly ill-conceived:

	Filenames should be constructed from the portable filename
	character set because the use of other characters can be
	confusing or ambiguous in certain contexts. (For example,
	the use of a colon ( ':' ) in a pathname could cause
	ambiguity if that pathname were included in a PATH
	definition.)

   This is all I could find on file names in the specification.

—
Wilfredo Sánchez - wsanchez@wsanchez.net


Mime
View raw message