apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roy T. Fielding" <field...@gbiv.com>
Subject Re: apr_filepath_encoding on Darwin
Date Thu, 19 Jul 2007 00:24:13 GMT
On Jul 18, 2007, at 9:55 AM, Wilfredo Sánchez Vega wrote:

> On Jul 18, 2007, at 2:11 AM, Joe Orton wrote:
>> - it is convention on all modern Unixes I'm aware of that filename
>> charset/encoding follows LC_CTYPE; not just Linux.  It may derive  
>> from
>> Solaris, I think that's where the locale APIs originate.
>   I guess I don't know how that works in practice.  When you have  
> an encoded string, you need to know it's encoding.  On a file  
> system, there is no meta data (typically) to indicate the encoding  
> of the file name string.
>   So I set my locale settings to correspond to encoding A and write  
> a file.  Yours is encoding B.  On Linux, one expects the file name  
> to display differently for the other user?

On Solaris, it is only documented within "man -s 5 environ"


              This category  specifies  character  classification,
              character  conversion, and widths of multibyte char-
              acters. When LC_CTYPE is set to a valid  value,  the
              calling utility can display and handle text and file
              names containing valid characters for  that  locale;
              Extended  Unix Code (EUC) characters where any indi-
              vidual character can be 1, 2, or 3 bytes  wide;  and
              EUC  characters  of  1,  2,  or 3 column widths. The
              default "C" locale corresponds to  the  7-bit  ASCII
              character  set;  only characters from ISO 8859-1 are
              valid.  The  information   corresponding   to   this
              category  is  stored  in  a  database created by the
              localedef() command.  This environment  variable  is
              used  by  ctype(3C),  mblen(3C),  and many commands,
              such as cat(1), ed(1), ls(1), and vi(1).

POSIX does not recognize the use of LC_CTYPE for filenames because
the locale is supposed to be set on a per-process basis.

It doesn't work in practice.  The hack was added on Solaris in order
to give the appearance of internationalization without changing the
existing filesystems.  A better implementation would define it once
per mount point, with iso-8859-1 as the pre-existing default, and
allow that to be overridden by directory (where the names are stored).
I think that is why this use of locale was never standardized.

A system less concerned with backwards compatibility is better off
with a requirement of utf-8, though OS X should have made the filename
encoding a mount option.  I assume that the ISO9660-Joliet (CD-ROM)  
driver does
some form of filename translation automatically from UCS-2.
In any case, even with the convention, it is left to the application
to determine how it will treat encoded filenames.  The OS X decision
to treat them all as utf-8 is at least consistent.  OTOH, this
is just a display convention -- OS X apps should have been designed
to treat the filename internally as an opaque nul-terminated array,
rather than barfing on non-utf8 encodings.

One thing I miss in OS X is an automated way for file archivers
(like unzip) to recognize and convert non-utf-8 filenames
when they are unarchived.  I frequently have to do that by hand
after unzipping something from China or Switzerland. Subversion
breaks on OS X whenever someone commits a filename with an e-grave,
which is a problem when your main product name is Communiqué.
I wonder if this change in APR would fix that error?


View raw message