directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ersin Er" <ersin...@gmail.com>
Subject Re: UTF-8 woes
Date Fri, 29 Dec 2006 20:08:48 GMT
On 12/29/06, Emmanuel Lecharny <elecharny@gmail.com> wrote:
> Ersin Er a écrit :
>
> > On 12/29/06, Emmanuel Lecharny <elecharny@gmail.com> wrote:
> >
> >> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
> >>
> >> You will have to be a little bit more explicit... How do you build
> >> your RDN?
> >> FYI, it is supposed to be a UTF-8 encoded String, so if you are to
> >> code an
> >> ä, you will have to :
> >> - create a byte array containing it's counterpart (0xC3 0xa4) and do
> >> a new
> >> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
> >> - OR do a new RDN( "\u00e4" );
> >>
> >> never do a new RDN( "ä" ), because then the String will be considered as
> >> ISO-8859-1 encoded  string (at least in Germany or in France, not in
> >> Turkey
> >> :)
> >
> >
> > What is the difference between creating an RDN with new RDN( "ä" ) and
> > with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ?
>
> There is a _big_ difference, because your java file might have been
> saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default
> encoding of your computer to store the file, and inside this file you
> have this "ä". There is no guarantee at all that it will be correct when
> you transform the string to UTF-8 bytes on another computer, using a
> different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8"
> ) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 =
> unicode encoded using bytes), and then it helps to translate the String
> to UTF-16. Of course, using \u00e4 should be the prefered way if you are
> to use internal Strings like "This is an umlaut : \u00e4" in your java file.

If your source code file contains "special characters" encoded in X
encoding, and if you compile that code with javac using the encoding X
(-encoding X), then there can be no problem. The so called special
character is safely translated to Java internal encoding. There is no
UTF-8 related stuff here. The X can be UTF-8 or not, that's all.

You can create your source code with ISO-8859-1, and safely compile it
without the encoding option while your platform encoding is
ISO-8859-1. The special characters will be converted to safe Java
UTF-16 forms. But if you send it to me, and if my platform encoding is
ISO-8859-9 (Turkish), and if I compile it with just javac (no encoding
option), the strings will be malformed (but will still compile). If I
give the option -encoding ISO-8859-1 to the compiler, there will be no
problem. There is still no problem related to UTF-8 here.

A mini reference: http://www.jorendorff.com/articles/unicode/java.html

> > There is
> > nothing as "UTF-8" String in Java.
>
> When you write new String( <some bytes>, "UTF-8" ), you just tell the
> JVM that the byte array is supposed to be a UTF-8 encoded String. It
> will trasnlate those bytes to UTF-16 chars, using one or two char if
> needed (Unicode can use up to 2^32 bits). For instance, the é in my name
> as a value of 0xE9 in Unicode, and is encoded 0xC3, 0xA9 in UTF-8. If
> you don't tell String() that the bytes array is UTF-8 encoded, then it
> will just consider that the byte array is using the default platform
> encoding. And if it's ISO8859-1, 0xC3 = 'Ã', and 0xA9 = '(c)', so you have
> now a Java String with is 2 chars long instead of one char long...



> > All strings are UTF-16. You can get
> > their representations in other encodings as byte arrays. So when you
> > do a new RDN( "ä" ), it should be converted to UTF-16 internally. What
> > am I missing here?
>
> It is transformed to UTF-16 accordingling to the encoding used on your
> platform. But then, if your local encoding is ISO-8859-1, when doing a
> String.getBytes( "UTF-8" ), you might have something very different to
> that you were expecting.
>
> Ok, this is not simple. A simple rule then :
> *always use \uxxxx when encoding non ASCII characters in a java file*
>
> >
> > (Not being able to display the character in source code in other
> > platforms is a different matter. It's about the text editor encoding.)
>
> yes, but you always use an editor to write your java file...
>
> At this point, I may also miss something, but I would then like to have
> more informations like a test case which expose the problem.
>
> Emmanuel.
>


-- 
Ersin
Mime
View raw message