hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sung-Gu" <jeri...@apache.org>
Subject Re: [VOTE] Re: 2.0 release - URI charset transformation HOW-TO
Date Thu, 26 Jun 2003 15:32:10 GMT
Adrian,

I attached the title like the above.  ;)
Please see some comment the below step 1-0 and 1-1.

Hope to be helpful for the furture,

Sung-Gu

----- Original Message ----- 
From: "Sung-Gu" <jericho@apache.org>
To: "Commons HttpClient Project" <commons-httpclient-dev@jakarta.apache.org>
Sent: Thursday, June 26, 2003 10:39 PM
Subject: Re: [VOTE] Re: 2.0 release - deprecate some methods?


> 
> ----- Original Message ----- 
> From: "Adrian Sutton" <adrian@intencha.com>
> 
> > If you don't know why the code would be useful or what it was
> > implemented based upon, why is it that you still want it in HttpClient?
> >   There is nothing that uses those methods anywhere in HttpClient  and
> > the presence of an FTP RFC that requires them still wouldn't make them
> > applicable to HttpClient since we aren't dealing with FTP.
> 
> It's not confined to only FTP.   It's for every internet 'application layer'
> programs.
> 
> 
> > String temporary = URIUtil.toUsingCharset(input, "UTF-8", "Big5");
> > String result = URIUTIL.toUsingCharset(temporary, "Big5", "UTF-8");
> > assertEquals(input, result);
> >
> >   * \u4E01 is a Chinese character.  You can substitute \uCBBF for a wide
> > range of Chinese characters and the test will still fail.
> >
> >   * Big5 is a very commonly used charset for Chinese characters.
> 
> [reminder]
> The first step in the process can be performed by maintaining a mapping
> table that includes the local character set code and the corresponding UCS
> code.

If you're eager to utilize it for Big5 and EBDIC. (I regard them as legacy charsets)
I think I can give you a hint for a try.    Please listen to me... 

Step 1-0)

For the basic preparation of the first step, your operating system should be
installed with unicode support language system for your local character.

For example, Big5 and something like Big5.UTF-16 or ch.UTF-16?
or EBDIC and something like EBDIC.UTF-16?  I don't know...
Perhaps you might utilize the URI.getDefaultDocumentCharsetByLocale
or URI.getDefaultDocumentCharsetByPlatform methods.  I'm not sure though.
If you're using Windows 2000 or XP, then you can find code page for
unicode I guess.  You should not confuse with code page for lagacy lang.

(I don't know really about that.... --a   anyway you need mapping table for Big5
or EBDIC...  Imagine EBDIC is really very legacy bit code for ony IBM?)

(About java? Well I don't expect any ISO-8859-45 or ISO-8859-99?
for chinise or EBDIC.)

> The next step is to convert the UCS character code to the UTF-8 encoding.
> 
> Hmmm.... I don't know about Big5 though...
> As I guess, Big5 is not an UCS.   It should be unicode for second step.

Step 1-1) Please see the previous comment.

> If you want to find an UCS for Big5 automatically, you should insert some
> code into the toUsingCharset method perhaps.

Step 1-? not belong to)

> Some might wor without UCS transformation though, it must be required I
> guess.

skip...

> > If you read the JavaDoc for the String constructor being used
> > (String(byte[], String)), it says:
> > "Constructs a new String by decoding the specified array of bytes using
> > the specified charset."
> > Note the use of the word "decoding" which means that instead of
> > creating a String backed by the given byte array, it uses the specified
> > charset to convert the bytes into actual characters - conceptually
> > these characters have no particular encoding since they are
> > (conceptually) the actual characters rather than a byte representation
> > of the characters.  In reality, the characters are represented in
> > memory by a series of bytes in UTF-8 encoding as required by the JVM
> > specification.
> 
> UTF-8 is tranformation charset, not really display charset.
> It's not always used as String class in java I guess.
> 
> > Secondly, the toUsingCharset method cannot work in most situations
> > because it converts the string to bytes using one encoding and then
> > converts those bytes to a String using a different encoding.  To
> > highlight why this cannot work, create a text file and save it to disk
> > using ASCII encoding.  Then, attempt to read the file back in as EBDIC
> > encoding (or any double-byte character charset like UTF-16), the text
> 
> EBDIC is also not UCS.
> 
> > will have become corrupted because the bytes were mapped to characters
> > using the wrong charset (a charset is simply a mapping between bytes
> > and characters).
> >
> > So, the possible ways for toUsingCharset to fulfill it's contract is
> > for it to be changed to:
> >
> > public String toUsingCharset(String target, String fromCharset, String
> > toCharset) {
> > return target;
> > }
> >
> > OR to:
> >
> > public byte[] toUsingCharset(String target, String toCharset) {
> > return target.getBytes(toCharset);
> > }
> >
> > OR to:
> >
> > public byte[] toUsingCharset(byte[] target, String fromCharset, String
> > toCharset) {
> > return new String(target, fromCharset).getBytes(toCharset);
> > }
> >
> > The last one is the only one that makes any sense at all, but I fail to
> > see how it is useful in HttpClient.
> 
> Well... it should be byte transformation.
> Like from srouce charset to the target charset.
> 
> Your first two examples look like just one way ticket to me.
> Probably it might work?
> Or the last one is similar though... I'm not sure...
> 
> > So Sung-Gu, please provide some justification for your -1 in terms of
> > why the methods should remain in HttpClient - in particular where in
> > HttpClient the method would be used and for what purpose.
> 
> As I mentioned prevously...  for example, a new method called perhaps
> 'toAnotherDisplay' using the toUsingCharset method were used to
> change your display for changing encoding by your web-browser directly...
> 
> 
> > Regards,
> >
> > Adrian Sutton.
> 
> Hope to be helpful,
> 
> Sung-Gu
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org
> 
> 
Mime
View raw message