hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sung-Gu" <jeri...@apache.org>
Subject Re: The use of UTIUtil.toUsingCharset?
Date Wed, 05 Feb 2003 02:39:11 GMT

----- Original Message -----
From: "Laura Werner" <laura@lwerner.org>


> Hi Sung-Gu,
>
> >Actually, that's very easy...
> >And not that important unless it's not going to be support multilinqual.
> >
> >As you see the diagram, bytes informations created from the original
charset
> >should be restored.  That's all.
> >
> >
> My understanding of what you're saying is that if someone constructs a
> URI using escaped characters in a particular charset (e.g. Big-5), using
> the URI(char[] escaped) constructor, then URI needs to preserve those
> characters.  If someone asks for the URI back as an escaped string in
> the original charset (e.g. Big-5 again), we need to give them the
> *exact* original string; it's not good enough to trancode from the
> escaped Big-5 string to Unicode and back to Big-5.  Is this correct?
>
> If this is true, I have a few comments on why this matters...
>
> -- First, for those who don't understand why you can't just convert
> everything to Unicode and stop worrying, there is some sense behind
> this.  When Unicode was invented, the far-east languages were "Unified"
> into the Han block of Unicode.  Some characters that have distinct codes
> in the native double-byte character sets were mapped to single Unicode
> characters.  This meant that some native character sets wouldn't round
> trip to Unicode and back.  It was essentially a political compromise --
> the Unicode folks needed to save space in the 64k base plane, so they
> merged Han characters that meant very similar things and looked almost
> exactly same.  (Emphasis "similar" and "almost".)  But in native
> charsets that didn't need to have room for Korean and Cyrillic and all
> the other stuff that's in Unicode, there's room to split out multiple
> versions of these characters that are merged together.
>
> -- There are also a few new character sets like JIS-212 that contain
> characters (like Japanese dental symbols, believe it or not) that
> haven't been encoded in Unicode yet.  Presumably we'd want to keep the
> encoded URI string around so that we can preserve this kind of character.
>
> (In a past life I managed the Unicode group at IBM, and I remember far
> more of this stuff than I thought I did.)

Excellent explantion!
It is described at a url that I poinited though on this mailling-list
before.
I think, your one is much nice! ;)

> A few comments on URI.java and URIUtil.java
>
> -- I think the comments need to be greatly improved.  It's very hard to


Not enough to just comment it out... I think...
Some article about this is written aleady in URI class for someone
to notice that...    and something is still left to do... as your comment...

> figure out what many of the methods do.  In the cases where I can figure
> out what they do, it's hard to figure out *why*.

>
> -- It would be nice if the documentation explained the charset concepts:
> What is a document charset and a protocol charset and so on.  A
> reference to the RFC is nice, but a more concice explanation in the
> JavaDoc would be better.

Actually, my problem is the fact that I just know how to, I guess.
It's hard for me to understand someones not to expience that....
I think I will have a chance sometime later...

> Laura, hoping I helped answer part of the "why" here, at least

Thank you very much, Laura! ;)

Sung-Gu

Mime
View raw message