tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Graf-Waczenski" <>
Subject RE: Character Encoding -ISo-8859-1 Vs UTF-8 Vs GBK
Date Tue, 18 Oct 2005 10:35:28 GMT

i think that the original poster needs some help with Java & encodings,
so i take the freedom to add some (simplified) background here.
(And sorry because this is outside the scope of Tomcat but pure Java)

A Java char is internally represented as the UTF encoded bytes
of that particular character. Due to UTF's capabilities, any
character in the world (and beyond, Klingon and some Fantasy
chars are also supported IIRC) is representable as Java char.

The challenge here is that the original file was not written
using the UTF encoding but probably a chinese encoding, which
means that the actual binary data in the file is different from
the binary data that you would have had if the file were
encoded in UTF

In order to create the correct UTF bytes from a file that was
encoded in another encoding, Java simply must know the encoding
that the file was originally written with, there is simply no
other way. So, when the file is read, what you basically get in
the first place is a byte[]. Java comes with several input stream
classes that perform some encoding magic for you, but none of
them is capable of performing "encoding guessing".

What is finally happening is:

byte[] bytes = .... // your raw bytes here
String s = new String(bytes, "UTF-8"); // garbage due to wrong encoding

The problem in the original poster's case is that the byte[]
above contains the bytes as they were written originally, so
in order to reconstruct the original characters, you need
to do so here:

String s = new String(bytes, "GB2312"); // no garbage if file was
encoded with GB2312

IIRC, the GB2312 encoding indeed is a superset of ISO-1 but it
still is different byte-wise from UTF, which is why you get garbage.



> -----Original Message-----
> From: David Delbecq []
> Sent: Tuesday, October 18, 2005 12:08 PM
> To: Tomcat Users List
> Subject: Re: Character Encoding -ISo-8859-1 Vs UTF-8 Vs GBK
> Hi,
> UTF-8 can handle european and chinese character very well.
> If you can't read using utf-8 any of those this simply
> mean you text file is not saved in utf-8.
> a écrit :
> >Hi,
> >I am trying to read the universal charater form a text file
> to my java
> >application that stores them in database. When I use
> encoding type "GBK" i
> >can read all special charater in chinease, when i use
> encoding "ISO-8859-1"
> >i can read latin but not chinease , but whn i use encoding
> as "UTF-8" i
> >think i ma supposed to read both chinease and latin
> correctly but i am not
> >able to read any of them. Can any one give me the pointers
> for solution ,
> >Further the beta- is converted to ss in latin-1
> >
> >thanks in advance
> >Birendar S Waldiya
> >
> >
> >Notice: The information contained in this e-mail message
> and/or attachments to it may contain confidential or
> privileged information.   If you are not the intended
> recipient, any dissemination, use, review, distribution,
> printing or copying of the information contained in this
> e-mail message and/or attachments to it are strictly
> prohibited.   If you have received this communication in
> error, please notify us by reply e-mail or telephone and
> immediately and permanently delete the message and any
> attachments.  Thank you
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail:
> >For additional commands, e-mail:
> >
> >
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message