httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Trawick <trawi...@bellsouth.net>
Subject Re: BUFF, IOL, Chunking, and Unicode in 2.0 (long)
Date Sun, 07 May 2000 17:24:22 GMT
I can't pretend to give useful responses to your questions, but as I
am eager to learn more of these very issues and figure out which of my
current work is incompatible with the big picture I'll chime in anyway
in the hopes that someone can pinpoint particularly blatant areas of
cluelessness.

> On Fri, 5 May 2000, Jeff Trawick wrote:
> 
> > >    Using (loadable?) translation tables based on unicode definitions
> > >    is a very similar approach to what libiconv offers you (see
> > >    http://clisp.cons.org/~haible/packages-libiconv.html -- though my
> > >    inspiration came from the russian apache, and I only heard about
> > >    libiconv recently). Every character set can be defined as a list
> > >    of <hex code> <unicode equiv> pairs, and translations between
> > >    several SBCS's can be collapsed into a single 256 char table.
> > >    Efficiently building them once only, and finding them fast is an
> > >    optimization task.
> 
> I've actually got a chunk of (perl) code which generates the C code to do
> such. I am now waiting for the Unicode 3.0 standard to see how up to date
> that code is; but wil most certainly want to advance that. It also does
> UTF8 conversion and 'closests' approxmiation.

This is an APR topic, right?  Does anybody think these details should
live outside of ap_xlate_*()?  (I'm not knocking it.  I just want to
understand if somebody thinks we shouldn't be using the APR
translation interfaces in buff.c.  Personally I hope we can separate
discussions of the low-level code to perform translations from
discussions of how Apache decides when/what to translate between and
how it manages the buffers passed to the APR routines.)  For the
platform for which I'm most interested in translation, iconv() is
fine, but I'm happy to 

a) work on new/changed interfaces to the APR xlate stuff which enable
   somebody else to plug in alterate translation mechanisms

b) help separate iconv()-dependent code in APR from generic support

I don't know about iconv() support on other platforms, but on OS/390 a
lot of character set translation is based on iconv() so functional
inadequacies in iconv() support tend to show up in a lot of places and
they end up getting addressed in iconv().  I would imagine that the
same conversions that are needed for serving data via a web server are
needed for ftp, telnet, and manual manipulation (e.g., with
iconv(1)).

> One thing not mentioned in the API is how this third layer knows enough
> about the data to do such conversion. At the least, if the input where
> UTF8 or unicode, it should know the destination charset, language and
> possibly mode of speach. In reality it might need the input
> charset+language and the destination charset+language.
> 
> Dw.

I guess you're talking about a combination of charset/language
negotiation?  I think that is the most interesting question, and I
look forward to Martin's response :)  I have looked through the
Russian Apache docs and also I think I have a basic feel for some of
the current problems that affect EBCDIC (particularly acute since none
of the EBCDIC charsets are supported by browsers).  I think that
Russian Apache has some nice configuration mechanisms but that perhaps
they provide a large superset of what is needed/what is a
straightforward solution.  

some comments on Russian Apache configuration directives:

CharsetDecl

  helps with negotiation when we have Accept-language but not
  Accept-charset? 

  basis for allowing aliases for the charset names? (i.e., required if
  you want CharsetAlias?)

CharsetAlias

  needed since there is no standard for character set names

CharsetRecodeTable, CharsetWideRecodeTable

  this type of info should be hidden in APR; I would hope that RA
  would be able to put their table support in APR as an implementation
  of ap_xlate_*()

CharsetSelectionOrder

  To me, this deals with some of the unnecessary superset of
  configuration primitives.  I don't think we need special controls
  for portnumber, hostname, or dirprefix, so I don't think this is
  needed.  

CharsetPriority

  fine, needed for handling negotiation as far as I know

CharsetDefault

  "This is the charset that will be provided to the client if all
   other ways of charset determination fail to work."

  In the absense of some declaration that says, for example, "by
  default translate IBM-xxxx to ISO-8859-7" then this seems to be very
  important.  But it seems clearer to directly say "if the page is in
  charset A, deliver it in charset B" or better yet "it makes sense to
  convert a page in charset A to charset B, charset C, or charset D"

CharsetByPort

  I don't understand why this is needed.  Doesn't virtual server plus
  some other required character set configuration provide the desired
  function? 

CharsetAgent

  ??

CharsetStrictURIMatch

  not necessary

CharsetSourceEnc

  this seems o.k., but I'm not sure this is part of a minimal
  solution.  More reading on my part is required :)

CharsetByExtension

  AddCharset is the official way to do this now, right?

everything else

  ??

For EBCDIC-on-OS/390 (and hopefully a slightly wider audience :) ), I
think that the default encoding in the absence of configuration is the
character set associated with the current locale.  I would want to set
up some global variables at initialization; these would be handles to
translate headers (based on how the code is compiled) and a handle to
translate content to ASCII (based on the current locale).

I don't have a strong opinion for what makes up the minimal set of
enhancements which would allow the most problems to be solved.  

If we simply have an AddCharset coded to tell us that a file is stored
in a certain charset, we still don't know what set of character sets
we should be willing to translate it into, right?

--------------------------------------------------------------------

Now once it is decided how we learn that translation (if any) to
perform, what are some lower-level details?

request_rec needs to carry around some information about character set
translation.  

#ifdef APACHE_XLATE
struct rr_xlate {
    whatever is needed, such as a translation handle (or NULL) for the
    current object being received by the client and a translation
    handle (or NULL) for the current object being sent to the client
};
#endif /* APACHE_XLATE */

struct request_rec {
    current stuff;
#ifdef APACHE_XLATE
    struct rr_xlate *rrx;
#endif
};

The fact that there is such a ptr in request_rec when APACHE_XLATE is
defined is cast in stone, but it must be understood that exactly what
is there is experimental for some time.

We can use a BUFF option to enable/disable translation and
store/retrieve the relevant translation handle:

  turn on BUFF translation for content:

    #ifdef APACHE_XLATE
    some logic to ensure proper xlate-on-write handle is stored in
      rr_xlate;
    ap_bsetopt(r->client, BO_WXLATE, &r->rrx->wxlate);
    #endif

  turn on BUFF translation for headers:

    /* EBCDIC problem only */
    #ifdef CHARSET_EBCDIC
    ap_bsetopt(r->client, BO_WXLATE, &hdrs_to_ascii);
    #endif

  save current state, turn on translation for headers on an EBCDIC platform:

    (this stuff would be in Martin's push/pop macros)

    #ifdef CHARSET_EBCDIC
    ap_xlate_t *wxlate;
    ap_bgetopt(r->client, BO_WXLATE, &wxlate);
    ap_bsetopt(r->client, BO_WXLATE, &hdrs_to_ascii );
    #endif

    write some headers;

    #ifdef CHARSET_EBCDIC
    ap_bsetopt(r->client, BO_WXLATE, &wxlate);
    #endif

Similar stuff would be used for manipulating the read translation
state in buff (i.e., whether or not we translate on a read operation
and what the translation handle is).

There would be special logic in bsetopt() for BO_WXLATE and BO_RXLATE
to manipulate the pointers to read/write operations for those
operations that have different entry points depending on whether or
not there is translation (see post from me yesterday for a terribly!
simple use of some of Martin's layering ideas which he posted 5 or 6
days ago).

If you guys would just drop by the house some afternoon, we could
pretty quickly figure out how to separate the problem into parts that
different parties could tackle :)

-- 
Jeff Trawick | trawick@ibm.net | PGP public key at web site:
     http://www.geocities.com/SiliconValley/Park/9289/
          Born in Roswell... married an alien...

Mime
View raw message