tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: Migrating to tomcat 6 gives formatted currency amounts problem
Date Fri, 12 Sep 2008 20:13:10 GMT
Rectification to the clarification : what I say below about UTF-16 being 
always 16-bit and limited is also nonsense.  UTF-16 is variable-length, 
it can cover the entire Unicode character set.  It just uses a variable 
number of 16-bit words per character, as compared to UTF-8 which uses a 
variable number of 8-bit bytes.
I should have checked my sources. Shame on me.

About Java's internal char type being 16-bit wide though, I have heard 
that too, and I'm also curious.

André Warnier wrote:
> Caldarale, Charles R wrote:
>>> From: Christopher Schultz [mailto:chris@christopherschultz.net]
>>> Subject: Re: Migrating to tomcat 6 gives formatted currency
>>> amounts problem
>>>
>>> (My understanding is that Unicode (16-bit) is actually not
>>> big enough for everything, but hey, they tried).
>>
>> Point of clarification: Unicode is NOT limited to 16 bits (not even in 
>> Java, these days).  There are defined code points that use 32 bits, 
>> and I don't think there's a limit, if you use the defined extension 
>> mechanisms.  Again, browsing the Unicode web site is extremely 
>> enlightening.
>>
> Further clarification :
> Unicode is not limited to anything.  Unicode is (aims to be) a list 
> which attributes to any distinct character known to man, a number, from 
> 0 to infinity. The particular position number given to a particular 
> character in this Unicode list is known as its "Unicode codepoint".
> The Unicode group (consortium ?) also tries to do this with some order, 
> such as trying to keep together (with consecutive codepoints) various 
> groups of characters that are logically related in some way.
> For example (but probably because they had to start somewhere), the 
> first 128 codepoints match the original 7-bit US-ASCII alphabet;
> so for instance the "capital letter A", which has code \x41 in US-ASCII, 
> happens to have Unicode codepoint \x0041 (both 65 in decimal terms).
> For example also, the same first 128 codepoints, plus the next 128 
> codepoints, match the iso-8859-1 alphabet (also known as iso-latin-1); 
> thus the character known as "capital letter A with umlaut" (an A with a 
> double-dot on top) has the codepoint \x00C4 in Unicode, and the code 
> \xC4 in iso-8859-1 (both 196 in decimal).
> 
> New Unicode characters (and codepoints) are being added all the time (I 
> think there's even Klingon in there), but there are also holes in the 
> list (presumably left for whenever some forgotten related character 
> shows up).
> 
> A quite different issue is encoding.
> 
> Because it would be quite impractical to specify a series of characters 
> just by writing their codepoints one after the other (using whatever 
> number of bits each codepoint needs), a series of clever schemes have 
> been devised in order to pass Unicode strings around, while being able 
> to separate them into characters, and keep each one with its proper 
> codepoint.
> Such schemes are known as "Unicode encodings" with names such as UTF-2, 
> UTF-7, UTF-8, UTF-16, UTF-32, etc..
> Each one of them specifies an algorithm whereby one can take any Unicode 
> character (or rather, its codepoint), and "encode" it into a series of 
> bits, in such a way that at the receiving end, an opposite algorithm can 
> be used to "decode" that series of bits and retrieve once again the same 
> series of Unicode codepoints (or characters).
> 
> UTF-16, for example, is an encoding of Unicode which uses always 16 bits 
> for each Unicode codepoint; but it is to my knowledge incomplete, 
> because since it uses a fixed number of 16 bit per character, it can 
> thus only ever represent no more than the first 65,532 Unicode 
> characters. (But we're not there yet, and there is still some leeway).
> 
> UTF-8 on the other hand is a variable-length scheme, using 1, 2, 3, or 
> more 8-bit groups to represent each Unicode codepoint.  And it is in 
> principle not limited, as there are extension mechanisms foreseen for 
> whenever the need arises (imagine that some aliens suddenly show up, and 
> that they happen to write in 167 different languages and alphabets).
> 
> One frequent misconception is that in UTF-8, the first 256 "character 
> encoding bit sequences" match the iso-8859-1 codepoints.
> Only the first 128 characters of iso-8859-1 (which happen to match the 
> 128 characters of US-ASCII and the first 128 Unicode codepoints), have a 
> single-byte representation in UTF-8 which happens to match their Unicode 
> codepoint.  The next 128 iso-8859-1 characters (which contain the 
> capital A with umlaut) require 2 bytes each in the UTF-8 encoding.
> Thus for instance, the "capital letter A with umlaut" has the Unicode 
> codepoint \x00C4 (196 decimal), because is is the 197th character in the 
> Unicode list (and the first one is \x0000).  It also happens to have the 
> code \xC4 (196 decimal) in the iso-8859-1 table.
> But in UTF-8, it is encoded as the two bytes \xC3\x84, which is not the 
> decimal number 196 in any way.
> 
> 
> All of that to say that when some people on this list say things like 
> "you should always decode your URLs as if they were Unicode (or UTF-8), 
> because it is the same as ASCII or iso-latin-1 anyway", they are talking 
> nonsense.  The only time you can do that is when the server and all the 
> clients have agreed in advance that this is how they were going to 
> encode and decode URLs.
> (That we developers wish it were so, and that ultimately we may get 
> there, is another matter.)
> 
> It is also talking nonsense to say that you should by default consider 
> html pages as UTF-8 encoded.  The default character set (and encoding, 
> because in that case both are the same) for html is iso-8859-1, and 
> anything else (including UTF-8 or UTF-16) is non-default.
> (see http://www.ietf.org/rfc/rfc2854.txt, section 6).
> (So if you do output something else, you *must* say so).
> (And hope that IE doesn't second-guess you).
> 
> We probably owe that to Tim Berners-Lee, and with tons of respect and 
> admiration for the guy notwithstanding, it may be an unfortunate 
> historical accident that he was born in England and worked in 
> Switzerland (both countries quite happy with iso-8859-1), rather than 
> being a Chinese national working in Greece e.g., who might have 
> preferred Unicode and UTF-8.  But hey, he invented it, so he got to choose.
> 
> Anyway for the time being we all have to live with it.
> Even the Tomcat guys.
> 
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message