hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <o.kalnichev...@dplanet.ch>
Subject Re: The use of UTIUtil.toUsingCharset?
Date Thu, 20 Feb 2003 15:50:26 GMT

I apologize for restarting this conversation, but I have to confess I
found myself not intelligent enough to be able to grasp grand designs of
the UTIUtil#toUsingCharset method

Sung-Gu apparently is too proud or too busy to spend his precious time
on such trifles as writing test cases or talking to such
primitive-minded fellas like me. I have no other choice but turn to you
for the guidance.

Ok. If I understand you right, you are saying is that there are charsets
that are inadequately represented in Unicode or not represented at all.
Absolutely fine with me. So, UTIUtil#toUsingCharset is supposedly needed
to help preserve those characters when performing charset translations.
Do I get it right?

Please have a look at the source code, though.

public static String toUsingCharset(
  String target, String fromCharset, String toCharset)
   throws URIException {
  try {
    return new String(target.getBytes(fromCharset), toCharset);
  } catch (UnsupportedEncodingException error) {
    throw new URIException(URIException.UNSUPPORTED_ENCODING,

As far as I can interpret these statements, a Unicode string is given as
input and another Unicode string is given back as output. 

My apologies, but was not the main thesis here that certain characters
simply cannot be represented in Unicode and therefore
UTIUtil#toUsingCharset was intended to address the problem?

Help! I must be really stupid, but I can't see how a direct translation
(not URLEncoding!!!!) of a Unicode string to byte array and back to a
Unicode string is supposed to help here.

I REALLY want to understand. Please help me


On Tue, 2003-02-04 at 22:51, Laura Werner wrote:
> Hi Sung-Gu,
> >Actually, that's very easy...
> >And not that important unless it's not going to be support multilinqual.
> >
> >As you see the diagram, bytes informations created from the original charset
> >should be restored.  That's all.
> >  
> >
> My understanding of what you're saying is that if someone constructs a 
> URI using escaped characters in a particular charset (e.g. Big-5), using 
> the URI(char[] escaped) constructor, then URI needs to preserve those 
> characters.  If someone asks for the URI back as an escaped string in 
> the original charset (e.g. Big-5 again), we need to give them the 
> *exact* original string; it's not good enough to trancode from the 
> escaped Big-5 string to Unicode and back to Big-5.  Is this correct?
> If this is true, I have a few comments on why this matters...
> -- First, for those who don't understand why you can't just convert 
> everything to Unicode and stop worrying, there is some sense behind 
> this.  When Unicode was invented, the far-east languages were "Unified" 
> into the Han block of Unicode.  Some characters that have distinct codes 
> in the native double-byte character sets were mapped to single Unicode 
> characters.  This meant that some native character sets wouldn't round 
> trip to Unicode and back.  It was essentially a political compromise -- 
> the Unicode folks needed to save space in the 64k base plane, so they 
> merged Han characters that meant very similar things and looked almost 
> exactly same.  (Emphasis "similar" and "almost".)  But in native 
> charsets that didn't need to have room for Korean and Cyrillic and all 
> the other stuff that's in Unicode, there's room to split out multiple 
> versions of these characters that are merged together.
> -- There are also a few new character sets like JIS-212 that contain 
> characters (like Japanese dental symbols, believe it or not) that 
> haven't been encoded in Unicode yet.  Presumably we'd want to keep the 
> encoded URI string around so that we can preserve this kind of character.
> (In a past life I managed the Unicode group at IBM, and I remember far 
> more of this stuff than I thought I did.)
> A few comments on URI.java and URIUtil.java
> -- I think the comments need to be greatly improved.  It's very hard to 
> figure out what many of the methods do.  In the cases where I can figure 
> out what they do, it's hard to figure out *why*. 
> -- It would be nice if the documentation explained the charset concepts: 
> What is a document charset and a protocol charset and so on.  A 
> reference to the RFC is nice, but a more concice explanation in the 
> JavaDoc would be better.
> Laura, hoping I helped answer part of the "why" here, at least
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org

View raw message