cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pier Fumagalli <>
Subject Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/
Date Mon, 05 Sep 2005 01:52:03 GMT
On 5 Sep 2005, at 01:53, Antonio Gallardo wrote:
> Pier Fumagalli wrote:
>> It's not a UTF-8 character, it's an UNICODE character: \u doesn't   
>> mean "UTF" but rather "UNICODE" (which is not an encoding).
> First, I request excuses because I wrote up the previous phrase  
> very badly. I wanted to state that I don't see a reason to use Java  
> "Unicode escaping" for this case. Reading the Java Specification,  
> we found [0]: " Programs are written using the/ /Unicode character  
> set.". So IMO a UNICODE 00B4, the Acute Accent in Latin-1, should  
> be only represented by only one code.

Programs are written using (yes) the UNICODE specification, but  
source ".java" files are not. If you notice the output of "javac - 
help" it will say:

   -encoding <encoding>      Specify character encoding used by  
source files

So, the encoding parameter (that by default /methinks is the  
platform's default) will interpret the byte stream with the specified  
encoding, and then, the decoded UNICODE character stream will be  
parsed. Much like:

InputStream javaSource = new FileInputStream(javaSourceFile);
Reader reader = new InputStreamReader(encodingSpecifiedInCommandLine);

So, programs are written using UNICODE characters, yes, the source  
files, though, are encoded using a some-sort of encoding mechanism  
(UTF-8, UTF-16, Shift-JIS, blablabla).

In our case the sequence "\u00B4" has the same byte representation  
(5c 75 30 30 62 34) in almost-all encodings (UTF-8, US-ASCII,  
ISO8859-1, Shift-JIS, ...), as it's composed by bytes in the range  
from 00 to 7F (which hardly changes in whatever encoding you put them  

Java uses this syntax to represent a UNICODE character, because with  
most of the encodings you can use, it won't normally change its  
UNICODE meaning.

That said, this is NOT a safe mechanism, because if (for example) you  
were to read the byte sequence "5c 75 30 30 62 34" using the EBCDIC  
encoding (IBM's mainframes encoding) you woudln't read "backslash"  
"letter u" "zero" "zero" "letter b" "four", but you would read  
something quite different: "asterisk" "nil" "nil" "nil" "nil" "pn".

For an example of the EBCDIC encoding, look here: http://

>> Depending on your platform encoding (yours apparently ISO8859-1,  
>> mine  UTF-8, my wife's -she's japanese- Shift-JIS) that sequence  
>> (B4) of  BYTES as in the original source code will be interpreted  
>> as a  different character.
> The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is  
> is exactly the same as using ISO-8859-1. We need to keep the  
> sources in UNICODE and there is also for Japanese: Hiragana,  
> Katakana, et al:

Err... Ehmmm.. No... The character in question (Latin-1 character B4,  
Acute Accent) is encoded in ISO8850-1 as the bytes sequence "B4",  
while in Shift-JIS the same character is encoded as byte sequence "81  
4C", quite different.

Reading the byte sequence "B4" in Shift-JS will produce Unicode  
character FF74 (Halfwidth katakana "E"), which is quite different  
from an acute accent as you intended.

Trust me, it's 9 years I'm doing this! :-)

>> Changing the binary sequence B4 to \u00B4 instructs the JVM that  
>> no  matter what encoding your platform is set to, the resulting  
>> character  will always (always) be UNICODE 00B4, the Acute Accent,  
>> part of the  Latin-1 (0X0080) table.
> If we wrote the code in UNICODE you will have the same effect. It  
> is exactly the same as with XML, isn't?

Unicode is simply a list of characters. To save them on a disk, you  
_need_ to use an encoding. Unicode characters are 32bits long (they  
were 16 bits until Unicode 4 came along, but that ain't important  
right now), bytes are 8bits long. It's as easy as that. To represent  
32 bits in 8, you need to "compress" them (or as said in I18N,  
"encoding" them).

Some encodings are complete (such as the family of UTF encodings)  
meaning that the encoding CAN represent ALL Unicode characters, some  
are not (such as ISO8859-1 which can represent only Unicode  
characters from 00 to FF).

Comparing Unicode to an encoding is like comparing an apple to a the  
speed of light: there's nothing in common between the two, but if you  
say that an apple is 1 meter per second, you can say that the speed  
of light is (roughly) 299.792.458 apples.

>> Let's call it defensive programming, and actually, in the source   
>> code, we should be using only characters in the range 00-7F  
>> (Unicode  BASIC-Latin, encoding US-ASCII), as that's the "most- 
>> common" amongst  all different encodings (even if when thinking  
>> about IBM's EBCDIC,  even that one might have some problems in  
>> some cases).
> I am sorry, but I do not like to cover the sun with a finger.


>  I believe Thorsten Schalab can tell us more about this topic. ;-)

Nah, I'm pretty confident that on this little nag, I'm right...


View raw message