httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dirk-Willem van Gulik <>
Subject Re: Multilingual Apache [Was: Re: mod_mime/3238: New directive suggestion: AddCharset (fwd)]
Date Mon, 26 Oct 1998 14:53:36 GMT

On Sun, 25 Oct 1998, Konstantin Chuguev wrote:

> Hello Dirk-Willem. I recalled I already met your name somewhere in WWW
> i18n resources. Searching through various sources, I found

> Of course, I read this earlier. Just have not remembered your name :-)
> Very useful paper. I would like to refer to it in the MultiWeb
> documentation.

Be carefull there, it is rather old. We've got some revised documentation
here; see the next message.
> And I hope this paper helped me to understand better your point of view
> about Web multilinguism.
> Excuse me if I misunderstood some of your notices below, but it seems
> like you did not read parts of MultiWeb documentation included into
> Apache docs (coming with distribution; online at
>  or, more correctly, like I
> did not write enough documentation yet :-)  Now I am doing exactly that
> (oh, writing papers is much more difficult than programming :-), but
> will try to explain some things here in the message. 

> > On Tue, 20 Oct 1998, Konstantin Chuguev wrote:
> > 
> > Actually, cause of the braindead way MIME handles charsets (i.e. as part
> > of the content-type, rather than as an independendt dimension or variant),
> > the way to do this in apache, since version 0.98 is to either add in your
> > mime.types file or with AddType something along the lines of
> > 
> > html_latin1     text/html;charset=iso-...
> > 
> > In fact, using AddCharset would be counter productive (beleive me I
> > tried!) unles you fix the entire content struct; i.e.break the implicit

> Could you give us an example of counter productivity of AddCharset?

Apache takes the MIME approach to charsets; i.e. they are effectively tied
to the mime-type. Even though the code has a separate field for it.

So if you have a document in two languages; each with a different charset;
your negotiation becomes horribly complex; and you either have to use
a variant on just one, the language, and hope that the receipient has
the font, or alternatively only have one-2-one relation between your
charset and language (i.e. a russion document is always stored in koi8, a
french one in latin1 and a Czech one in latin2).
> Anyway, this method is not the only one. I have it in MultiWeb, but
> never use it myself. There is another method, which suits my need
> better.  It works if file doesn't have a charset suffix. Looks like
> that: 
> <Language ru>
> 	ServerCharset koi8-r
> </Language>
> It's true that practically all resources in the same language share the
> same charset on the server (or at least in some server's subdirectory:
> <Language> directives can have any Apache context - up to <Files ...>).
> There's no need to label the document with a charset suffix in that
> case.

Exactly; you've put it way better than I can put it; and thus... if you
just tie it to the mime type you are there.. without having to do the
above. Except.. and this is a nice idea which I like, when you want to
go as far down as per individual files. I had not realized that; and I
agree that it can be very useful.

> I don't. But someone might want to do that.
> home.en.latin1.html > > Which today just work fine.  

> I have avoided changing request_rec content struct by storing the
> charset information in the r->notes table. http_protocol.c is patched a
> bit to insert that information into the Content-Type response header

Ah, so you avoid using the content_* containers. This makes sense. But it
would not make the other modules ware.

> line.  Another change in the http_protocol.c file is turning on the
> charset converter in case of textual content (I cannot be sure that
> content type is textual in a fixup_handler, where the converter is set
> up, because CGI scripts can set it later).  This is the dirty hack, but
> it seem to be unavoidable if I need the functionality MultiWeb has.  I

Yes I agree.

> would like to have the standard mechanism of this in Apache.  Until it
> happened (I hope :-) I try to make the minimal changes of the original
> sources. 

Well, the real solution might be in apache 2.0; where we might just have
streamed layers to take care of just that.

> > > > The implementation may well need cleaning up, but the idea sounds like
> > > > may possibly have value if it isn't too expensive.
> > 
> > > Just today the latest version is released: Apache-1.3.3-MultiWeb-3.2.
> > >
> > > Some details are on
> > >
> > > Unfortunately, not much documentation now, but I am working on it.
> > >
> > > Although my implementation is kind of expensive, I think it can
> > > be useful for somebody...
> > 
> > It is actually a nice piece of work; though I worry about the i18n side,
> > as it seems to have broken a server which does not have strictly
> > paralellel text in it. And yes it is very expensive :-).

> If I understood it right, you are afraid about unilingual servers or
> ones having resources with different content in different languages? 

Well, about servers which _label_ what they send out, even when that label
does not quite apply; i.e. compare to the Accept header of netscape
whcih says */* as the first entry.

> I am ready to discuss the expensiveness and minimize it.
> I really wonder how there is still no public available charset
> conversion  API.

Actually, there is; see mod_i18n which uses the CCC-API which, to the best
of my knowledge is public. It uses (non normalized :-() unicode as the
basis. You might want to look at it. It was a terena project. I think your
inet96 paper even pointed to it. But yes, those API's do tend to mix
the concept of glyphs with charsets and languages, and the C3 one seems
to have never made it beyond ap45.


View raw message