cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject Re: [d'oh!] java APIs are not powerful enough to handle the XML spec!!
Date Fri, 14 Nov 2003 14:58:05 GMT

On 13 Nov 2003, at 21:08, J.Pietschmann wrote:

> Stefano Mazzocchi wrote:
>> The day somebody asks you why java needs to be replaced, one answer 
>> will be 'it only supports 16-bits chars'. laughable as it might seem, 
>> it's true.
>> yes, people, a Unicode char is not 16 bit (as I always though!) but 
>> 32!!
> This is a misconception.

yeah, well, I'm not talking about the encoding, but about the fact that 
you can't fit all unicode chars in 16 bits address space, that was my 

UTF-32 uses 32 bit flat encoding. UTF-16 and UTF-8 use a different type 
of encoding (which is the same I used in the SAX compiler that cocoon 

> Unicode is an odd mixture: at the same time it defines codepoints for
> representing characters and "surrogate characters" for encoding
> non-baseplane characters (whose codepoints don't fit into 16 bit).
> ISO 10646 originally intended to use full 32bit for 2^64 characters.
> Because of slow progress an complaints about "wasting space", the
> Unicode consortium was formed which made quick progress on specifying
> a 16-bit charcater set. The surrogate characters were built in in case
> more than 2^16 characters came along, and for giving people plenty of
> room to experiment themself in the "private areas" there. Meanwhile,
> ISO-10646 and Unicode converged: ISO limited the charset to 0x110000
> characters, which should be enough for everyone, and Unicode dropped
> the "16 bit charset" notation, they just define codepoints.
> Unfortunately for them, they can't undo the surrogate character mess
> and other wicked problems they now like to get rid of (singletons,
> certain compatibility characters, some presentation forms, ligatures).

I'm very ignorant on these things, I must admit! thanks for sharing.

> A Java "char" variable can't hold non-baseplane Unicode charaters, but
> Java strings can. For Sun JVMs, they are basically a UTF-16 encoded
> Unicode strings. BTW there are JVMs out there which use UTF-8 in
> Java Strings, the same way strings are stored in class files.

> The point is of course: can the run time libraries handle non-baseplane
> characters?

It's even worse! Is javac able to handle UTF-16 encoded files? If so, 
would it be able to do:

  String nonBaseplaneString = "... some non-baseplane chars here ";

and what would be the use of this, if I can't guarantee that


will yield true all the time?

> The java.text.BreakIterator can, but that's no magic. I
> have no idea whether for example AWT display routines can display non-
> baseplane characters, mainly because I've yet to get an appropriate
> font. The TTF unicode mapping tables allocate, lo and behold, 16 bits
> for the character. Who's complaining about Java?
> BTW Mozilla can't deal with non-baseplane characters either, to the
> chagrin of the MathML folks who use them for mathematical presentation
> forms. Guess what's the main reason, beside fonts: C's wchar_t is 16 
> bit
> too.

well, to be honest, I thought as well that moving from 8 bits to 16 
bits for address space would have solved all our issues with chars once 
and for all... so I don't feel like blaming them for not having thought 
of more complex issues :-/

>> now, if you thought you could take the character() SAX event and 
>> create a String out of it and do something useful with is (like print 
>> it, for example), forget it. The result will very likely not be the 
>> one you expect.
> That's an interesting observation. I never had problems in this area.
> But this may have something to do with the fact that I never went out 
> of
> the Unicode baseplane with my chars.

Yeah, nobody ever did (this came out after testing Slide for webdav 
compliance)... but I have the feeling this will bite us in the back in 
the future.

>> Another reason not to use Stings at all.
> Stings are bad, of course :-)


> Strings are another matter. In fact, Strings should be preferred over
> char arrays because they can hide the actual representation of the 
> Unicode
> strings.

Very true! Missed that.

> If you use character arrays, you have to deal with surrogate
> character pairs yourseelf. A substring() could be implemented to deal
> with non-baseplane characters correctly. Of course, Java was invented
> when people thought of Unicode as 16 bit charset, and the standardized
> behaviour is that the String methods operate on the internal char 
> array.

talking about a mess :-(


View raw message