directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lecharny <elecha...@gmail.com>
Subject Re: UTF-8 woes
Date Fri, 29 Dec 2006 19:37:07 GMT
Ersin Er a écrit :

> On 12/29/06, Emmanuel Lecharny <elecharny@gmail.com> wrote:
>
>> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
>>
>> You will have to be a little bit more explicit... How do you build 
>> your RDN?
>> FYI, it is supposed to be a UTF-8 encoded String, so if you are to 
>> code an
>> ä, you will have to :
>> - create a byte array containing it's counterpart (0xC3 0xa4) and do 
>> a new
>> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
>> - OR do a new RDN( "\u00e4" );
>>
>> never do a new RDN( "ä" ), because then the String will be considered as
>> ISO-8859-1 encoded  string (at least in Germany or in France, not in 
>> Turkey
>> :)
>
>
> What is the difference between creating an RDN with new RDN( "ä" ) and
> with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ? 

There is a _big_ difference, because your java file might have been 
saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default 
encoding of your computer to store the file, and inside this file you 
have this "ä". There is no guarantee at all that it will be correct when 
you transform the string to UTF-8 bytes on another computer, using a 
different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" 
) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 = 
unicode encoded using bytes), and then it helps to translate the String 
to UTF-16. Of course, using \u00e4 should be the prefered way if you are 
to use internal Strings like "This is an umlaut : \u00e4" in your java file.

> There is
> nothing as "UTF-8" String in Java. 

When you write new String( <some bytes>, "UTF-8" ), you just tell the 
JVM that the byte array is supposed to be a UTF-8 encoded String. It 
will trasnlate those bytes to UTF-16 chars, using one or two char if 
needed (Unicode can use up to 2^32 bits). For instance, the é in my name 
as a value of 0xE9 in Unicode, and is encoded 0xC3, 0xA9 in UTF-8. If 
you don't tell String() that the bytes array is UTF-8 encoded, then it 
will just consider that the byte array is using the default platform 
encoding. And if it's ISO8859-1, 0xC3 = 'Ã', and 0xA9 = '©', so you have 
now a Java String with is 2 chars long instead of one char long...

> All strings are UTF-16. You can get
> their representations in other encodings as byte arrays. So when you
> do a new RDN( "ä" ), it should be converted to UTF-16 internally. What
> am I missing here?

It is transformed to UTF-16 accordingling to the encoding used on your 
platform. But then, if your local encoding is ISO-8859-1, when doing a 
String.getBytes( "UTF-8" ), you might have something very different to 
that you were expecting.

Ok, this is not simple. A simple rule then :
*always use \uxxxx when encoding non ASCII characters in a java file*

>
> (Not being able to display the character in source code in other
> platforms is a different matter. It's about the text editor encoding.)

yes, but you always use an editor to write your java file...

At this point, I may also miss something, but I would then like to have 
more informations like a test case which expose the problem.

Emmanuel.

Mime
View raw message