apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William A. Rowe, Jr." <wr...@rowe-clan.net>
Subject Re: canonical stuff (was: Re: apache 2.0.11 - tag 2.0.12?)
Date Sat, 24 Feb 2001 17:31:49 GMT
From: "Greg Stein" <gstein@lyra.org>
Sent: Saturday, February 24, 2001 3:44 AM

> On Fri, Feb 23, 2001 at 02:21:22PM -0600, William A. Rowe, Jr. wrote:
> >...
> > I have some very major structural hacking to do to wipe out the old canonical
> > methods - and a quiet house to do so for the next two days.  I don't want to
> > start warping the source as must be done till we have this 'good' tag so other
> > folks can start looking for any remaining leaks and holes.
> Can we do the canonical stuff in pieces rather than wholesale? IOW, add the
> new functions into CVS and review. After that is stable, then start the
> conversion process. (specifically, there was a lot of concerns all around
> about how this stuff would be built/operate, so it seems prudent to do that
> outline via actual code, agree on it, then to use it)

Yes yes yes!  Very shortly... let's please get 2_0_12 ready (I see you did :-)

> In a similar vein, when you added all that Unicode stuff, it just kind of
> dropped into the code. No big deal as it was all Win32 specific (i.e. it
> didn't affect my playground), but it was an awfully big change. Especially
> in the semantics. We still haven't refactored the API into two sets of
> functions (one for Unicode chars, one for 8-bit native).

I'm absolutely positively near certain we won't.  Please let me explain.

The underlying 'real' filesystem on WinNT [not on 9x] is Unicode.  There is a
huge body of folks that don't restrict their playgrounds to ASCII - much of
their keyboard isn't.  I wrote a java script playground about two years ago
to experiment with client-side reporting [allowing the client to handle the
cruft of resorting reports.  Problem?  http://mall.lnd.com/wrowe/pokémon/
wasn't working for me, where it was on my local file system.  Gave up, of
course, at that time.

What does this have to do with anything?  Win32 is completely bogus in terms
of context.  This patch was required, as soon as the per-user-vhost stuff is
added to the mpm, we begin to see the 128-255 character values start to shift
based on the --user's-- desired codepage and location.  This is entirely bogus
for a server, although it is pretty cool for a shared interactive workstation.

Apache - in terms of canonical, absolute values for filenames - doesn't care to
see shifting cruft like that.  So we end up in a very odd situation.  Either we
dismiss this all and set up a bunch of 'use this codepage' controls to assure
we aren't shifting around - or we simply use the full and unrestricted charset.

Users have asked for Unicode filenames.  Very few members of this list would
care to see the Apache engine provide wchar_t support for all our strings.  That
would be a monster.  Since everything we do is down the 8 bit wire, it makes
next to 0 sense to even attempt it.

How does Unix provide support?  Utf-8, typically.  Yes - you can setup your
local code page and even support multibyte encodings like jis - but why?  If
you are serving international web sites - Utf-8 is the way to go for naming
resources across many languages.

But most importantly, once we convert to unicode, we break the 255 character
limits on file pathnames.  We break through Windows internal name conversion
that always occurs from user's current codepage into unicode.  This needs the
benchmarking and comparison, optimization of my quick (?) utf8 converter, and
possibly other refinements, but it is the way to go.

We aren't done - getting Unicode environment variables into perl or other unicode
enabled parsers still needs to be done.  But I don't see a clean way to provide
both high-bit latin characters and Unicode at the same time.

If we layer our encoding onto the filename functions, and eventually allow the
user to specify a file name in any encoding, that's cool.  That could be it's own
API, or maybe simply an apr_filesystem_encoding_set/apr_filesystem_encoding_get
[I don't like this solution in a multi-threaded/multiple libraries linking against
a common apr library scenario.]

All we did was transform ambigous naming into absolute naming as the underlying
API --- where we go from here is the apr community's choice, but IMHO Apache
doesn't need the second API.  All Apache needs now is the code to detect FFFE or
FEFF from the first two bytes of any config file, to decide it's a file saved as
unicode and convert to utf-8 on the fly.


View raw message