apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roy T. Fielding" <field...@gbiv.com>
Subject Re: apr_filepath_encoding on Darwin
Date Tue, 07 Aug 2007 01:40:37 GMT
On Aug 6, 2007, at 5:39 PM, Wilfredo Sánchez Vega wrote:
> On Aug 6, 2007, at 5:11 PM, Roy T. Fielding wrote:
>> Actually, it also crashes on valid utf-8 in normal form, because OS X
>> doesn't follow the standard on normalization.  See "man -s 5 utf8":
>>     If more than a single representation of a value exists (for  
>> example,
>>     0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation  
>> is always
>>     used.  Longer ones are detected as an error as they pose a  
>> potential
>>     security risk, and destroy the 1:1 character:octet sequence  
>> mapping.
>> but OS X requires the longer composition characters over shorter  
>> ones.
>> My guess is that choice was driven by the way the UI allows such
>> characters to be composed (like "alt-u u" for uumlaut).
>   Above the VFS layer, we always use decomposed UTF-8.

Er, yeah, did I say that backwards?  The man page says that equivalent
characters will use the shortest representation, which would mean
always using the composed form of UTF-8.  Right?  So the man page
for utf8 (from BSD) should be updated to explain the OS X quirks.

I learned something new today -- use the -v option with ls to display
non-ASCII filenames.

>> What I do currently is define
>>   setenv  MM_CHARSET "utf-8"
>>   setenv  LANG       "en_US.utf-8"
>> in my shell init file.
>   On Mac OS (at least), that isn't relevant with respect to  
> filenames, which is what the patch that Erik proposed fixes.

Yeah, but it is relevant on Solaris, which is why subversion attempts
to use it. *shrug*  I'll commit the patch if I ever get a chance to
compile it.

>   It is, however, relevant to how a CLI application encodes data  
> sent to the terminal.  That is, the above means that Terminal.app  
> expects to see UTF-8 English text.  (I think; again, I don't really  
> know much about BSD locale settings.)

Terminal.app has its own Preferences that defines the encoding used.
I don't think that is be overridden by the environment variables.


View raw message