directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lecharny <elecha...@gmail.com>
Subject Re: UTF-8 woes
Date Fri, 29 Dec 2006 21:16:46 GMT
Ersin Er a écrit :

> On 12/29/06, Emmanuel Lecharny <elecharny@gmail.com> wrote:
>
>> Ersin Er a écrit :
>>
>> > On 12/29/06, Emmanuel Lecharny <elecharny@gmail.com> wrote:
>> >
>> >> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
>> >>
>> >> You will have to be a little bit more explicit... How do you build
>> >> your RDN?
>> >> FYI, it is supposed to be a UTF-8 encoded String, so if you are to
>> >> code an
>> >> ä, you will have to :
>> >> - create a byte array containing it's counterpart (0xC3 0xa4) and do
>> >> a new
>> >> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
>> >> - OR do a new RDN( "\u00e4" );
>> >>
>> >> never do a new RDN( "ä" ), because then the String will be 
>> considered as
>> >> ISO-8859-1 encoded  string (at least in Germany or in France, not in
>> >> Turkey
>> >> :)
>> >
>> >
>> > What is the difference between creating an RDN with new RDN( "ä" ) and
>> > with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ?
>>
>> There is a _big_ difference, because your java file might have been
>> saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default
>> encoding of your computer to store the file, and inside this file you
>> have this "ä". There is no guarantee at all that it will be correct when
>> you transform the string to UTF-8 bytes on another computer, using a
>> different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8"
>> ) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 =
>> unicode encoded using bytes), and then it helps to translate the String
>> to UTF-16. Of course, using \u00e4 should be the prefered way if you are
>> to use internal Strings like "This is an umlaut : \u00e4" in your 
>> java file.
>
>
> If your source code file contains "special characters" encoded in X
> encoding, and if you compile that code with javac using the encoding X
> (-encoding X), then there can be no problem. The so called special
> character is safely translated to Java internal encoding. There is no
> UTF-8 related stuff here. The X can be UTF-8 or not, that's all.

Yes, but then you will have to inform all the users about the encoding 
used when you have saved the java file. And trust me, people in Korea 
are not using ISO8859-1 encoding, and have no idea what can be a "ä"... 
The reverse is also true :)
Using the -encoding X is overkilling, IMHO. It's much more preferable to 
declare those special chars using the '\uxxxx' notation, or for 
international strings, to use and external property files, with all the 
foreign languages if needed (_FR, _DE, .. proerty files).

>
> You can create your source code with ISO-8859-1, and safely compile it
> without the encoding option while your platform encoding is
> ISO-8859-1. The special characters will be converted to safe Java
> UTF-16 forms. But if you send it to me, and if my platform encoding is
> ISO-8859-9 (Turkish), and if I compile it with just javac (no encoding
> option), the strings will be malformed (but will still compile). If I
> give the option -encoding ISO-8859-1 to the compiler, there will be no
> problem. There is still no problem related to UTF-8 here.

I didn't say that there were a pb with UTF-8. UTF-8 is just a way to 
encode Unicode using bytes. But, yes, you are right, given that you 
_know_ that I have used ISO-8859-1 encoding to write my file, then you 
just have to use -x ISO-8859-1 flag to compile it on your platform. But 
I hope you know which encoding is using Trustin, or any other people in 
the world not living in western europ or USA :) A little bit cumbersome, 
isn't it ?

Whatever, this should not be a problem for us. Again, if you have to use 
special chars in your code, use '\uxxxx' notation, for the good of all 
other people. If it's for messages, then I18n is you friend. And 
whatever encoding your file (ISO-8859-1 or -2 or -xxx) will be ok so 
far, as you will just use US ASCII chars, so -X encoding flag will be 
useless ;)

oh, btw, Unicode is really a mess, did I already said that ? :)

Emmanuel

Mime
View raw message