tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Migrating to tomcat 6 gives formatted currency amounts problem
Date Fri, 12 Sep 2008 20:03:29 GMT
Caldarale, Charles R wrote:
>> From: Christopher Schultz []
>> Subject: Re: Migrating to tomcat 6 gives formatted currency
>> amounts problem
>> (My understanding is that Unicode (16-bit) is actually not
>> big enough for everything, but hey, they tried).
> Point of clarification: Unicode is NOT limited to 16 bits (not even in Java, these days).
 There are defined code points that use 32 bits, and I don't think there's a limit, if you
use the defined extension mechanisms.  Again, browsing the Unicode web site is extremely enlightening.
Further clarification :
Unicode is not limited to anything.  Unicode is (aims to be) a list 
which attributes to any distinct character known to man, a number, from 
0 to infinity. The particular position number given to a particular 
character in this Unicode list is known as its "Unicode codepoint".
The Unicode group (consortium ?) also tries to do this with some order, 
such as trying to keep together (with consecutive codepoints) various 
groups of characters that are logically related in some way.
For example (but probably because they had to start somewhere), the 
first 128 codepoints match the original 7-bit US-ASCII alphabet;
so for instance the "capital letter A", which has code \x41 in US-ASCII, 
happens to have Unicode codepoint \x0041 (both 65 in decimal terms).
For example also, the same first 128 codepoints, plus the next 128 
codepoints, match the iso-8859-1 alphabet (also known as iso-latin-1); 
thus the character known as "capital letter A with umlaut" (an A with a 
double-dot on top) has the codepoint \x00C4 in Unicode, and the code 
\xC4 in iso-8859-1 (both 196 in decimal).

New Unicode characters (and codepoints) are being added all the time (I 
think there's even Klingon in there), but there are also holes in the 
list (presumably left for whenever some forgotten related character 
shows up).

A quite different issue is encoding.

Because it would be quite impractical to specify a series of characters 
just by writing their codepoints one after the other (using whatever 
number of bits each codepoint needs), a series of clever schemes have 
been devised in order to pass Unicode strings around, while being able 
to separate them into characters, and keep each one with its proper 
Such schemes are known as "Unicode encodings" with names such as UTF-2, 
UTF-7, UTF-8, UTF-16, UTF-32, etc..
Each one of them specifies an algorithm whereby one can take any Unicode 
character (or rather, its codepoint), and "encode" it into a series of 
bits, in such a way that at the receiving end, an opposite algorithm can 
be used to "decode" that series of bits and retrieve once again the same 
series of Unicode codepoints (or characters).

UTF-16, for example, is an encoding of Unicode which uses always 16 bits 
for each Unicode codepoint; but it is to my knowledge incomplete, 
because since it uses a fixed number of 16 bit per character, it can 
thus only ever represent no more than the first 65,532 Unicode 
characters. (But we're not there yet, and there is still some leeway).

UTF-8 on the other hand is a variable-length scheme, using 1, 2, 3, or 
more 8-bit groups to represent each Unicode codepoint.  And it is in 
principle not limited, as there are extension mechanisms foreseen for 
whenever the need arises (imagine that some aliens suddenly show up, and 
that they happen to write in 167 different languages and alphabets).

One frequent misconception is that in UTF-8, the first 256 "character 
encoding bit sequences" match the iso-8859-1 codepoints.
Only the first 128 characters of iso-8859-1 (which happen to match the 
128 characters of US-ASCII and the first 128 Unicode codepoints), have a 
single-byte representation in UTF-8 which happens to match their Unicode 
codepoint.  The next 128 iso-8859-1 characters (which contain the 
capital A with umlaut) require 2 bytes each in the UTF-8 encoding.
Thus for instance, the "capital letter A with umlaut" has the Unicode 
codepoint \x00C4 (196 decimal), because is is the 197th character in the 
Unicode list (and the first one is \x0000).  It also happens to have the 
code \xC4 (196 decimal) in the iso-8859-1 table.
But in UTF-8, it is encoded as the two bytes \xC3\x84, which is not the 
decimal number 196 in any way.

All of that to say that when some people on this list say things like 
"you should always decode your URLs as if they were Unicode (or UTF-8), 
because it is the same as ASCII or iso-latin-1 anyway", they are talking 
nonsense.  The only time you can do that is when the server and all the 
clients have agreed in advance that this is how they were going to 
encode and decode URLs.
(That we developers wish it were so, and that ultimately we may get 
there, is another matter.)

It is also talking nonsense to say that you should by default consider 
html pages as UTF-8 encoded.  The default character set (and encoding, 
because in that case both are the same) for html is iso-8859-1, and 
anything else (including UTF-8 or UTF-16) is non-default.
(see, section 6).
(So if you do output something else, you *must* say so).
(And hope that IE doesn't second-guess you).

We probably owe that to Tim Berners-Lee, and with tons of respect and 
admiration for the guy notwithstanding, it may be an unfortunate 
historical accident that he was born in England and worked in 
Switzerland (both countries quite happy with iso-8859-1), rather than 
being a Chinese national working in Greece e.g., who might have 
preferred Unicode and UTF-8.  But hey, he invented it, so he got to choose.

Anyway for the time being we all have to live with it.
Even the Tomcat guys.

To start a new topic, e-mail:
To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message