apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Huelsmann" <ehu...@gmail.com>
Subject Re: apr_filepath_encoding on Darwin
Date Tue, 17 Jul 2007 22:12:42 GMT
On 7/17/07, Wilfredo Sánchez Vega <wsanchez@wsanchez.net> wrote:
> On Jul 17, 2007, at 5:25 AM, Joe Orton wrote:
> > On Tue, Jul 17, 2007 at 02:14:25PM +0200, Erik Huelsmann wrote:
> >> Reading [1], I conclude that applications should pass UTF-8 to BSD
> >> functions such as stat() at all times. This suggests to me that
> >> apr_filepath_encoding() should return APR_FILEPATH_ENCODING_UTF8.
> >>
> >> Yet, looking at the sources, on any Unixy system, it returns
> >>
> >> Is this an oversight, or am I missing something else?
>    My understanding is that in Darwin/Mac OS, all file names, when
> accessed above the VFS layer, are, by convention, decomposed UTF-8.
> This is confirmed by the Tech Note:
>      http://developer.apple.com/qa/qa2001/qa1173.html
>    At the top:
>         In Mac OS X's VFS API file names are, by definition,
>         canonically decomposed Unicode, encoded using UTF-8.

Which suggests that apr_filepath_encoding should return
APR_FILEPATH_ENCODING_UTF8, if I'm not mistaken.

>    Under "Returning Names", it is clear that the file system
> implementation is expected to convert the on-disk file name encoding
> (if known) to decomposed UTF-8:
>         When returning names to higher layers (for example,
>         from your VOP_READDIR entry point), you should always
>         return decomposed names. If your underlying volume
>         format uses precomposed names, you should convert
>         any precomposed characters to their decomposed
>         equivalents before returning them to the system.
>    Note that the above is considerably easier for a file system like
> HFS+, where we know the on-disk encodings.  It's trickier for any file
> system which doesn't specify the file name encoding, which
> unfortunately is most.  It's particularly tricky when the volume
> format is shared across different operating systems, since other
> systems do not, AFAICT, have well-established conventions for file
> name encoding (*).
>    Note also that the convention is not enforced per se (**).  As a
> result, you aren't guaranteed, even on Mac OS, that file names are
> valid UTF-8 (***).  That poses interesting problems.  For example,
> CFString (and therefore basically all Mac apps) have been known to
> barf (and crash) when given a file name which isn't UTF-8, since it is
> typically told that it is UTF-8.

But generally applications won't work with these non-UTF8 paths if
they are well behaving MacOSX apps themselves, right? That reduces
chances of being fed garbage. But, other OSes can't guarantee
UTF-8ness either, because LANG (and LC_CTYPE) can be user-settings,
which can differ for different users, but path names are the same for
all users. So on Linux you can't be too sure either.

Thanks for the extensive explanation.


View raw message