apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William A. Rowe, Jr." <wr...@rowe-clan.net>
Subject Re: unicode file APIs (was: Re: canonical stuff)
Date Mon, 26 Feb 2001 02:21:26 GMT
From: "dean gaudet" <dgaudet-list-new-httpd@arctic.org>
Sent: Sunday, February 25, 2001 7:42 PM


> i'm a bit of an I18N novice, but doesn't it all just magically work if you
> use UTF-8 encoding everywhere?
>
> UTF-8 deliberately avoids using \0 and / in the encodings.  plain ascii
> works unmodified.  unix filesystems generally support UTF-8 directly
> (because of the \0 and / avoidance).
>
> this allows you to have a single API which understands unicode on all
> platforms -- you don't need to have _u versions which take unicode
> strings.

You are understanding exactly what I proposed with APR_HAS_UNICODE_FS.
My only small change is a way to get config directives in with wchar
support.  Since Win32 has no utf-8 editor, I'm working out the patch
to recognize the lead word of a unicode stream and switch to unicode
to utf-8 conversion.  Even notepad on Win32 supports unicode files, so
this becomes a no-brainer for administrators.

> give this page a perusal:  http://www.cl.cam.ac.uk/~mgk25/unicode.html

I especially liked a comment from http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

a.. External file system drivers such as VFAT and WinNT have to convert file name character
encodings. UTF-8 has to be added to the
list of already available conversion options, and the mount command has to tell the kernel
driver that user processes shall see
UTF-8 file names. Since VFAT and WinNT use already Unicode anyway, UTF-8 has the advantage
of guaranteeing a lossless conversion
here.

My key concept is _lossless_.  All SomeWin32FunctionA() variants are lossy, and
their encoding doesn't correspond to MS's own clib [we can comment on their lack
of brain cells here ... but we won't.]  All SomeWin32FunctionW() variants are
not only lossless, but faster.  Obviously we replace their conversion cycles
from local code page to unicode with our own utf-8 to unicode functions, but that
shouldn't (if I succeeded) add any net CPU cycles.

Of course they don't correspond to the clib functions [e.g. - consider strlen()]
but we are damned if we do... damned if we don't.  mod_autoindex obviously needs
to see APR_IS_UNICODE_FS and adjust the width accordingly.  We will get there, but
we aren't there yet.

If we support the native narrow characters we need an effective API to do so
[should we use the current ansi code page or the current oem code page?]  We didn't
have a respectable design, and this change made all those other issues mute.



Mime
View raw message