cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J.Pietschmann" <>
Subject Re: [d'oh!] java APIs are not powerful enough to handle the XML spec!!
Date Thu, 13 Nov 2003 20:08:28 GMT
Stefano Mazzocchi wrote:
> The day somebody asks you why java needs to be replaced, one answer will 
> be 'it only supports 16-bits chars'. laughable as it might seem, it's true.
> yes, people, a Unicode char is not 16 bit (as I always though!) but 32!!

This is a misconception.
Unicode is an odd mixture: at the same time it defines codepoints for
representing characters and "surrogate characters" for encoding
non-baseplane characters (whose codepoints don't fit into 16 bit).

ISO 10646 originally intended to use full 32bit for 2^64 characters.
Because of slow progress an complaints about "wasting space", the
Unicode consortium was formed which made quick progress on specifying
a 16-bit charcater set. The surrogate characters were built in in case
more than 2^16 characters came along, and for giving people plenty of
room to experiment themself in the "private areas" there. Meanwhile,
ISO-10646 and Unicode converged: ISO limited the charset to 0x110000
characters, which should be enough for everyone, and Unicode dropped
the "16 bit charset" notation, they just define codepoints.
Unfortunately for them, they can't undo the surrogate character mess
and other wicked problems they now like to get rid of (singletons,
certain compatibility characters, some presentation forms, ligatures).

A Java "char" variable can't hold non-baseplane Unicode charaters, but
Java strings can. For Sun JVMs, they are basically a UTF-16 encoded
Unicode strings. BTW there are JVMs out there which use UTF-8 in
Java Strings, the same way strings are stored in class files.

The point is of course: can the run time libraries handle non-baseplane
characters? The java.text.BreakIterator can, but that's no magic. I
have no idea whether for example AWT display routines can display non-
baseplane characters, mainly because I've yet to get an appropriate
font. The TTF unicode mapping tables allocate, lo and behold, 16 bits
for the character. Who's complaining about Java?

BTW Mozilla can't deal with non-baseplane characters either, to the
chagrin of the MathML folks who use them for mathematical presentation
forms. Guess what's the main reason, beside fonts: C's wchar_t is 16 bit

> now, if you thought you could take the character() SAX event and create 
> a String out of it and do something useful with is (like print it, for 
> example), forget it. The result will very likely not be the one you expect.

That's an interesting observation. I never had problems in this area.
But this may have something to do with the fact that I never went out of
the Unicode baseplane with my chars. Heck, I'

> Another reason not to use Stings at all.
Stings are bad, of course :-)
Strings are another matter. In fact, Strings should be preferred over
char arrays because they can hide the actual representation of the Unicode
strings. If you use character arrays, you have to deal with surrogate
character pairs yourseelf. A substring() could be implemented to deal
with non-baseplane characters correctly. Of course, Java was invented
when people thought of Unicode as 16 bit charset, and the standardized
behaviour is that the String methods operate on the internal char array.


View raw message