httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Kraemer <Martin.Krae...@Mch.SNI.De>
Subject BUFF, IOL, Chunking, and Unicode in 2.0 (long)
Date Tue, 02 May 2000 13:51:30 GMT

Sorry for a long silence in the past weeks, I've been busy with other
stuff.

Putting the catch-words "Chunking, Unicode and 2.0" into the subject
was on purpose: I didn't want to scare off anyone because of the word
EBCDIC: the problems I describe here, and the proposed new buff.c
layering, are mostly independent from the EBCDIC port.


In the past weeks, I've been thinking about today's buff.c (and
studied its applicability for automatic conversion stuff like in the
russian apache, see apache.lexa.ru). I think it would be neat to be
able to do automatic character set conversion in the server, for
example by negotiation (when the client sends an Accept-Charset and
the server doesn't have a document with exactly the right Charset, but
knows how to generate it from an existing representation).

IMO it is a reoccurring problem,

* not only in today's russian internet environment (de facto browsers
  support 5 different cyrillic character sets, but the server doesn't
  want to hold every document in 5 copies, so an automatic translation
  is performed by the russian apache, depending on information supplied
  by the client, or by explicit configuration). One of the supported
  character sets is Unicode (UTF-7 or UTF-8)

* in japanese/chinese environments, support for 16 bit character sets
  is an absolute requirement. (Other oriental scripts like Thai get
  along with 8 bit: they only have 44 consonants and 16 vowels).
  Having success on the eastern markets depends to a great deal on
  having support for these character sets. The japanese Apache
  community hasn't had much contact with new-httpd in the past, but
  I'm absolutely sure that there is a "standard japanese patch" for
  Apache which would well be worth integrating into the standard
  distribution. (Anyone on the list to provide a pointer?)

* In the future, more and more browsers will support unicode, and so
  will the demand grow for servers supporting unicode. Why not
  integrate ONE solution for the MANY problems worldwide?

* The EBCDIC port of 1997 has been a simple solution for a rather
  simple problem. If we would "do it right" for 2.0 and provide a
  generic translation layer, we would solve many problems in a single
  blow. The EBCDIC translation would be only one of them.

Jeff has been digging through the EBCDIC stuff and apparently
succeeded in porting a lot of the 1.3 stuff to 2.0 already. Jeff, I'd
sure be interested in having a look at it. However, when I looked at
buff.c and the new iol_* functionality, I found out that iol's are not
the way to go: they give us no solution for any of the conversion
problems:

* iol's sit below BUFF. Therefore, they don't have enough information
  to know which part of the written byte stream is net client data,
  and which part is protocol information (chunks, MIME headers for
  multipart/*).

* iol's don't allow simplification of today's chunking code. It is
  spread thruout buff.c and there's a very hairy balance between
  efficiency and code correctness. Re-adding (EBCDIC/UTF) conversion,
  possibly with sup[port for multi byte character sets (MBCS), would
  make a code nightmare out of it. (buff.c in 1.3 was "almost" a
  nightmare because we had onlu single byte translations.

* Putting conversion to a hierarchy level any higher than buff.c is no
  solution either: for chunks, as well as for multipart headers and
  buffering boundaries, we need character set translation. Pulling it
  to a higher level means that a lot of redundant information has to
  be passed down and up.

In my understanding, we need a layered buff.c (which I number from 0
upwards):

0) at the lowest layer, there's a "block mode" which basically
   supports bread/bwrite/bwritev by calling the equivalent iol_*
   routines. It doesn't know about chunking, conversion, buffering and
   the like. All it does is read/write with error handling.

1) the next layer handles chunking. It knows about the current
   chunking state and adds chunking information into the written
   byte stream at appropriate places. It does not need to know about
   buffering, or what the current (ebcdic?) conversion setting is.

2) this layer handles conversion. I was thinking about a concept
   where a generic character set conversion would be possible based on
   Unicode-to-any translation tables. This would also deal with
   multibyte character sets, because at this layer, it would
   be easy to convert SBCS to MBCS.
   Note that conversion *MUST* be positioned above the chunking layer
   and below the buffering layer. The former guarantees that chunking
   information is not converted twice (or not at all), and the latter
   guarantees that ap_bgets() is looking at the converted data
   (-- otherwise it would fail to find the '\n' which indicates end-
   of-line).
   Using (loadable?) translation tables based on unicode definitions
   is a very similar approach to what libiconv offers you (see
   http://clisp.cons.org/~haible/packages-libiconv.html -- though my
   inspiration came from the russian apache, and I only heard about
   libiconv recently). Every character set can be defined as a list
   of <hex code> <unicode equiv> pairs, and translations between
   several SBCS's can be collapsed into a single 256 char table.
   Efficiently building them once only, and finding them fast is an
   optimization task.

3) This last layer adds buffering to the byte stream of the lower
   layers. Because chunking and translation have already been dealt
   with, it only needs to implement efficient buffering. Code
   complexity is reduced to simple stdio-like buffering.


Creating a BUFF stream involves creation of the basic (layer 0) BUFF,
and then pushing zero or more filters (in the right order) on top of
it. Usually, this will always add the chunking layer, optionally add
the conversion layer, and usually add the buffering layer (look for
ap_bcreate() in the code: it almost always uses B_RD/B_WR).

Here's code from a conceptual prototype I wrote:
    BUFF *buf = ap_bcreate(NULL, B_RDWR), *chunked, *buffered;
    chunked   = ap_bpush_filter(buf,     chunked_filter, 0);
    buffered  = ap_bpush_filter(chunked, buffered_filter, B_RDWR);
    ap_bputs("Data for buffered ap_bputs\n", buffered);


Using a BUFF stream doesn't change: simply invoke the well known API
and call ap_bputs() or ap_bwrite() as you would today. Only, these
would be wrapper macros

    #define ap_bputs(data, buf)             buf->bf_puts(data, buf)
    #define ap_write(buf, data, max, lenp)  buf->bf_write(buf, data, max, lenp)

where a BUFF struct would hold function pointers and flags for the
various levels' input/output functions, in addition to today's BUFF
layout.

For performance improvement, the following can be added to taste:

* fewer buffering (zero copy where possible) by putting the buffers
  for buffered reading/writing down as far as possible (for SBCS: from
  layer 3 to layer 0). By doing this, the buffer can also hold a
  chunking prefix (used by layer 1) in front of the buffering buffer
  to reduce the number of vectors in a writev, or the number of copies
  between buffers. Each layer could indicate whether it needs a
  private buffer or not.

* intra-module calls can be hardcoded to call the appropriate lower
  layer directly, instead of using the ap_bwrite() etc macros. That
  means we don't use the function pointers all the time, but instead
  call the lower levels directly. OTOH we have iol_* stuff which uses
  function pointers anyway. We decided in 1.3 that we wanted to avoid
  the C++ type stuff (esp. function pointers) for performance reasons.
  But it would sure reduces the code complexity a lot.

The resulting layering would look like this:

    | Caller: using ap_bputs() | or ap_bgets/apbwrite etc.
    +--------------------------+
    | Layer 3: Buffered I/O    | gets/puts/getchar functionality
    +--------------------------+
    | Layer 2: Code Conversion | (optional conversions)
    +--------------------------+
    | Layer 1: Chunking Layer  | Adding chunks on writes
    +--------------------------+
    | Layer 0: Binary Output   | bwrite/bwritev, error handling
    +--------------------------+
    | iol_* functionality      | basic i/o
    +--------------------------+
    | apr_* functionality      |
    ....

-- 
<Martin.Kraemer@MchP.Siemens.De>             |    Fujitsu Siemens
Fon: +49-89-636-46021, FAX: +49-89-636-41143 | 81730  Munich,  Germany

Mime
View raw message