httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Title, Richard" <rti...@rational.com>
Subject multibyte characters in server root directory (and other I18N iss ues)
Date Wed, 18 Sep 2002 20:22:10 GMT
As a bit of background, I am incorporating a somewhat customized version
of Apache 2.0.x into a software product here. Our product installer allows
the user to specify an install-to directory. On foreign-language systems
(e.g. Japanese) this could be a directory name with multibyte characters in
it.

Our I18N testing found that there are problems if our Apache is installed
into
directories with certain "bad" doublebyte characters.

For instance, suppose the root of the Apache tree is in a directory whose
name ends in the Japanese character whose code is 0x835c (I won't try
to enter the character here, since I'm writing this email on an
English-language OS).
In this case in the httpd.conf file there will be a directive
ServerRoot dirname
where dirname is the Japanese string ending in this character (and there
will be other
directives in the config file containing this string as well). 
Now, 0x5c is the ASCII code for '\', but in this case it is not a backslash,
it's the 2nd byte
of a 2-byte Japanese character code. What happens in this situation is that
Apache complains 
at startup with a message to the effect that ServerRoot must be a valid
directory, and it fails to start.

Digging into the source code, I find a couple of underlying problems that
account for this. The filename-parsing code in
srclib/apr/file_io/win32/filepath.c
scans filenames character-by-character looking for special characters like
'/'
or '\', but the code is not cognizent of multibyte characters. So it can be
fooled
into thinking a byte is a '\' separator when it is really the second byte of
a multibyte
character in the filename. Even if I correct this, I run
into problems elsewhere. For example, the code in server/util.c which helps
parse config files also has byte-by-byte processing which is not
multibyte-aware.
So it can be fooled into thinking the 2nd byte of a 2-byte character is a
continuation
character, if that second byte is '\'. There may be other places in Apache's
source
code that have internationalization problems as well; the above are just a
couple
I found so far.

What do folks think? Is this too much of an edge case to care about, or is
it
worth fixing? Obviously Apache handles multibyte *content* OK. But it does
seem to have some problems dealing with multibyte directory names and
with multibyte characters in the config file. Is this a known issue? Is this
something
anyone has plans to fix? If I fix it, would apache.org be interested in
taking back
the above-mentioned source files with the fixes?

Thanks,

Rich Title
Rational Software



Mime
View raw message