httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William A. Rowe, Jr." <wr...@lnd.com>
Subject RE: BUFF, IOL, Chunking, and Unicode in 2.0 (long)
Date Tue, 02 May 2000 15:17:10 GMT
A few have argued recently that there is little point in releasing a
2.0 revision, since it's a glorified 1.3.x without the truly application
independent APR.  I have to agree that APR, to date, is rather cobbled
together pieces of Apache.  Movement underway makes me believe this will
change over the course of the 2.0 release.

On the server side, for admins, we really don't have what would appear
to be a 'later and greater' set of features.  This isn't true for NT
and some other previously neglected platforms, of course. But to the
Unix admin, they may already be wondering "what's the fuss?".

I am 100% behind the structure changes of 2.0 - but they are developer
features and optimizations, certain to (sooner or later) please the
module developers.  I assert that true Unicode capability will stem 
many potential arguments in favor other servers, and make Apache the
server of choice for the next 5 years.

Adding to Martin's thoughts, it would be trivial, in this example, to
implement a mod_uhtml module for serving raw unicode html files.  Perl
is there.  Why aren't we?

If I have time to contribute later in the month, his proposal is one
I would offer 100% support to (in some code, as well).  The funny thing
about his proposal, it could be 0 copy on Unicode platforms (such as
Windows NT), if the native sendfile semantics offer unicode support!

This also makes IPv6 much more trivial, IMHO.  However, it may not be
the solution to the EBCDIC issue, for performance reasons alone.  The
big question you left us with is the optimization of Unicode translation
since we can't (don't want to) keep 4MB of 64KB tables in memory (for 
the out-of-unicode scenario) so I'm willing to work up an optimized 
set of APR for this.  Is everyone else ready to make this plunge?

Bill


> -----Original Message-----
> From: Martin Kraemer [mailto:Martin.Kraemer@Mch.SNI.De]
> Sent: Tuesday, May 02, 2000 8:52 AM
> To: new-httpd@apache.org
> Cc: trawickj@bellsouth.net
> Subject: BUFF, IOL, Chunking, and Unicode in 2.0 (long)
> 
> 
> 
> Sorry for a long silence in the past weeks, I've been busy with other
> stuff.
> 
> Putting the catch-words "Chunking, Unicode and 2.0" into the subject
> was on purpose: I didn't want to scare off anyone because of the word
> EBCDIC: the problems I describe here, and the proposed new buff.c
> layering, are mostly independent from the EBCDIC port.
> 
> 
> In the past weeks, I've been thinking about today's buff.c (and
> studied its applicability for automatic conversion stuff like in the
> russian apache, see apache.lexa.ru). I think it would be neat to be
> able to do automatic character set conversion in the server, for
> example by negotiation (when the client sends an Accept-Charset and
> the server doesn't have a document with exactly the right Charset, but
> knows how to generate it from an existing representation).
> 
> IMO it is a reoccurring problem,
> 
> * not only in today's russian internet environment (de facto browsers
>   support 5 different cyrillic character sets, but the server doesn't
>   want to hold every document in 5 copies, so an automatic translation
>   is performed by the russian apache, depending on 
> information supplied
>   by the client, or by explicit configuration). One of the supported
>   character sets is Unicode (UTF-7 or UTF-8)
> 
> * in japanese/chinese environments, support for 16 bit character sets
>   is an absolute requirement. (Other oriental scripts like Thai get
>   along with 8 bit: they only have 44 consonants and 16 vowels).
>   Having success on the eastern markets depends to a great deal on
>   having support for these character sets. The japanese Apache
>   community hasn't had much contact with new-httpd in the past, but
>   I'm absolutely sure that there is a "standard japanese patch" for
>   Apache which would well be worth integrating into the standard
>   distribution. (Anyone on the list to provide a pointer?)
> 
> * In the future, more and more browsers will support unicode, and so
>   will the demand grow for servers supporting unicode. Why not
>   integrate ONE solution for the MANY problems worldwide?
> 
> * The EBCDIC port of 1997 has been a simple solution for a rather
>   simple problem. If we would "do it right" for 2.0 and provide a
>   generic translation layer, we would solve many problems in a single
>   blow. The EBCDIC translation would be only one of them.
> 
> Jeff has been digging through the EBCDIC stuff and apparently
> succeeded in porting a lot of the 1.3 stuff to 2.0 already. Jeff, I'd
> sure be interested in having a look at it. However, when I looked at
> buff.c and the new iol_* functionality, I found out that iol's are not
> the way to go: they give us no solution for any of the conversion
> problems:
> 
> * iol's sit below BUFF. Therefore, they don't have enough information
>   to know which part of the written byte stream is net client data,
>   and which part is protocol information (chunks, MIME headers for
>   multipart/*).
> 
> * iol's don't allow simplification of today's chunking code. It is
>   spread thruout buff.c and there's a very hairy balance between
>   efficiency and code correctness. Re-adding (EBCDIC/UTF) conversion,
>   possibly with sup[port for multi byte character sets (MBCS), would
>   make a code nightmare out of it. (buff.c in 1.3 was "almost" a
>   nightmare because we had onlu single byte translations.
> 
> * Putting conversion to a hierarchy level any higher than buff.c is no
>   solution either: for chunks, as well as for multipart headers and
>   buffering boundaries, we need character set translation. Pulling it
>   to a higher level means that a lot of redundant information has to
>   be passed down and up.
> 
> In my understanding, we need a layered buff.c (which I number from 0
> upwards):
> 
> 0) at the lowest layer, there's a "block mode" which basically
>    supports bread/bwrite/bwritev by calling the equivalent iol_*
>    routines. It doesn't know about chunking, conversion, buffering and
>    the like. All it does is read/write with error handling.
> 
> 1) the next layer handles chunking. It knows about the current
>    chunking state and adds chunking information into the written
>    byte stream at appropriate places. It does not need to know about
>    buffering, or what the current (ebcdic?) conversion setting is.
> 
> 2) this layer handles conversion. I was thinking about a concept
>    where a generic character set conversion would be possible based on
>    Unicode-to-any translation tables. This would also deal with
>    multibyte character sets, because at this layer, it would
>    be easy to convert SBCS to MBCS.
>    Note that conversion *MUST* be positioned above the chunking layer
>    and below the buffering layer. The former guarantees that chunking
>    information is not converted twice (or not at all), and the latter
>    guarantees that ap_bgets() is looking at the converted data
>    (-- otherwise it would fail to find the '\n' which indicates end-
>    of-line).
>    Using (loadable?) translation tables based on unicode definitions
>    is a very similar approach to what libiconv offers you (see
>    http://clisp.cons.org/~haible/packages-libiconv.html -- though my
>    inspiration came from the russian apache, and I only heard about
>    libiconv recently). Every character set can be defined as a list
>    of <hex code> <unicode equiv> pairs, and translations between
>    several SBCS's can be collapsed into a single 256 char table.
>    Efficiently building them once only, and finding them fast is an
>    optimization task.
> 
> 3) This last layer adds buffering to the byte stream of the lower
>    layers. Because chunking and translation have already been dealt
>    with, it only needs to implement efficient buffering. Code
>    complexity is reduced to simple stdio-like buffering.
> 
> 
> Creating a BUFF stream involves creation of the basic (layer 0) BUFF,
> and then pushing zero or more filters (in the right order) on top of
> it. Usually, this will always add the chunking layer, optionally add
> the conversion layer, and usually add the buffering layer (look for
> ap_bcreate() in the code: it almost always uses B_RD/B_WR).
> 
> Here's code from a conceptual prototype I wrote:
>     BUFF *buf = ap_bcreate(NULL, B_RDWR), *chunked, *buffered;
>     chunked   = ap_bpush_filter(buf,     chunked_filter, 0);
>     buffered  = ap_bpush_filter(chunked, buffered_filter, B_RDWR);
>     ap_bputs("Data for buffered ap_bputs\n", buffered);
> 
> 
> Using a BUFF stream doesn't change: simply invoke the well known API
> and call ap_bputs() or ap_bwrite() as you would today. Only, these
> would be wrapper macros
> 
>     #define ap_bputs(data, buf)             buf->bf_puts(data, buf)
>     #define ap_write(buf, data, max, lenp)  
> buf->bf_write(buf, data, max, lenp)
> 
> where a BUFF struct would hold function pointers and flags for the
> various levels' input/output functions, in addition to today's BUFF
> layout.
> 
> For performance improvement, the following can be added to taste:
> 
> * fewer buffering (zero copy where possible) by putting the buffers
>   for buffered reading/writing down as far as possible (for SBCS: from
>   layer 3 to layer 0). By doing this, the buffer can also hold a
>   chunking prefix (used by layer 1) in front of the buffering buffer
>   to reduce the number of vectors in a writev, or the number of copies
>   between buffers. Each layer could indicate whether it needs a
>   private buffer or not.
> 
> * intra-module calls can be hardcoded to call the appropriate lower
>   layer directly, instead of using the ap_bwrite() etc macros. That
>   means we don't use the function pointers all the time, but instead
>   call the lower levels directly. OTOH we have iol_* stuff which uses
>   function pointers anyway. We decided in 1.3 that we wanted to avoid
>   the C++ type stuff (esp. function pointers) for performance reasons.
>   But it would sure reduces the code complexity a lot.
> 
> The resulting layering would look like this:
> 
>     | Caller: using ap_bputs() | or ap_bgets/apbwrite etc.
>     +--------------------------+
>     | Layer 3: Buffered I/O    | gets/puts/getchar functionality
>     +--------------------------+
>     | Layer 2: Code Conversion | (optional conversions)
>     +--------------------------+
>     | Layer 1: Chunking Layer  | Adding chunks on writes
>     +--------------------------+
>     | Layer 0: Binary Output   | bwrite/bwritev, error handling
>     +--------------------------+
>     | iol_* functionality      | basic i/o
>     +--------------------------+
>     | apr_* functionality      |
>     ....
> 
> -- 
> <Martin.Kraemer@MchP.Siemens.De>             |    Fujitsu Siemens
> Fon: +49-89-636-46021, FAX: +49-89-636-41143 | 81730  Munich,  Germany
> 

Mime
View raw message