httpd-docs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Malo ...@perlig.de>
Subject Re: Japanese transformation is not stable
Date Tue, 27 Jul 2004 20:18:40 GMT
* Hiroaki KAWAI <hawk@mail.brain.k.u-tokyo.ac.jp> wrote:

> > > Hmm. It still happens, that different JREs (?) produce different
> > > iso-2022-jp output (i.e. any time someone builds all and diffs, he gets
> > > .ja.jis diffs.
> > 
> > Well, at least mine removes bogus escape sequences and
> > produce more desirable output but yeah, it still happens.
> 
> Last few months, I've encountered some bugs of the implementation of 
> iso-2022-jp charset converter of Sun JRE, but the converter will be soon 
> stable I think. 
> I'm working on the input XML files, and I'm not watching the generated 
> html files. I feel the diffs of the htmls are not so important than 
> those of the xmls. 
> I can just ignore the diffs of generated html files right now.
> 
> Well, I don't understand what the diffs do harm to us, so can I ask some 
> reasons?

The problem is, that someone who builds the whole tree gets japanese diffs -
and most people just cannot decide if they made somehting wrong or not (I can,
because I've glanced over the accompanying RFC ;-)

The second, though not so important reason is, that I'm currently working on
restructuring the docs to create a better platform for translators which
includes rewriting the styles and the build tools. I'm using the result diffs
to check if something went wrong.

> > > I'd suggest to switch the transformation finally to shift_jis, which is
> > > more stable (because there are none of these problematic escape
> > > sequences).
> > 
> > I'd rather use euc-jp than shift_jis.  For one thing,
> > shift_jis is a nightmare for auto detection since almost all
> > byte sequence can represent a valid character.  If I choose
> > from three major character encoding scheme in Japan, I
> > always choose euc-jp.  It doesn't have quirks sjis has.  The
> > fact that current one uses iso-2022-jp is just from legacy
> > reasons.
> 
> IMHO, whatever charset we choose, more or less, we will face this kind of 
> problem. 
> # I, myself prefer UTF8. :-)
> ## Because it support wide area of characters. 

UTF-8 is cool, but too large for the resulting html pages. A two-byte encoding
is way smaller and the wider area of characters one needs (if any) are
supported by html itself (&#xxx;).

> But, shift_jis is actually worse choise because there're well known 
> issuses around Shift_JIS and CP932 charsets. 
> The alias definition changed and changed between the release of Java.

Ok. That's reason why I've asked. I've had shift_jis in my mind, since we're
currently recoding to shift_jis for the CHM files, because the html help
compiler seems to support only this charset for Japanese. If euc-jp is better
for the online pages, we should use it.

If noone objects, I'm going to start conversion to euc-jp within some days.

Just to make clear: that this doesn't affect the *source* encoding. Keep it as
you like.

nd
-- 
Flhacs wird im Usenet grundsätzlich alsfhc geschrieben. Schreibt man
lafhsc nicht slfach, so ist das schlichtweg hclafs. Hingegen darf man
rihctig ruhig rhitcgi schreiben, weil eine shcalfe Schreibweise bei
irhictg nicht als shflac angesehen wird.       -- Hajo Pflüger in dnq

---------------------------------------------------------------------
To unsubscribe, e-mail: docs-unsubscribe@httpd.apache.org
For additional commands, e-mail: docs-help@httpd.apache.org


Mime
View raw message