cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Gallardo <>
Subject Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/
Date Mon, 05 Sep 2005 06:43:16 GMT
Pier Fumagalli wrote:

> On 5 Sep 2005, at 01:53, Antonio Gallardo wrote:
>> Pier Fumagalli wrote:
>>> Depending on your platform encoding (yours apparently ISO8859-1,  
>>> mine  UTF-8, my wife's -she's japanese- Shift-JIS) that sequence  
>>> (B4) of  BYTES as in the original source code will be interpreted  
>>> as a  different character.
>> The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is  
>> is exactly the same as using ISO-8859-1. We need to keep the  sources 
>> in UNICODE and there is also for Japanese: Hiragana,  Katakana, et 
>> al:
> Err... Ehmmm.. No... The character in question (Latin-1 character B4,  
> Acute Accent) is encoded in ISO8850-1 as the bytes sequence "B4",  
> while in Shift-JIS the same character is encoded as byte sequence "81  
> 4C", quite different.
> Reading the byte sequence "B4" in Shift-JS will produce Unicode  
> character FF74 (Halfwidth katakana "E"), which is quite different  
> from an acute accent as you intended.
> Trust me, it's 9 years I'm doing this! :-)

Yes, I believe you. :-) When I told that using Shift-JIS and ISO-8859-1 
is the same. I had in mind that they don't represent the full unicode 
expectrum. I was just tryin to show this problem in other char-set So in 
fact we are in the same problem. Of course that I am aware that both 
codesets (Shift-JIS and ISO-8859-1) are different UNICODE subset. This 
is same as you stated.

>>> Changing the binary sequence B4 to \u00B4 instructs the JVM that  
>>> no  matter what encoding your platform is set to, the resulting  
>>> character  will always (always) be UNICODE 00B4, the Acute Accent,  
>>> part of the  Latin-1 (0X0080) table.
>> If we wrote the code in UNICODE you will have the same effect. It  is 
>> exactly the same as with XML, isn't?
> Unicode is simply a list of characters. To save them on a disk, you  
> _need_ to use an encoding. Unicode characters are 32bits long (they  
> were 16 bits until Unicode 4 came along, but that ain't important  
> right now), bytes are 8bits long. It's as easy as that. To represent  
> 32 bits in 8, you need to "compress" them (or as said in I18N,  
> "encoding" them).
> Some encodings are complete (such as the family of UTF encodings)  
> meaning that the encoding CAN represent ALL Unicode characters, some  
> are not (such as ISO8859-1 which can represent only Unicode  
> characters from 00 to FF).

Yes. Please correct me here if I am wrong: Our SVN uses UTF-8 as the 
default charset (or encoding) or not? If not, then we need to take care 
not only of java sources but also of the chars above 7F in the XML files.

I have special interest in that, since we wrote mostly spanish messages. 
I will like to know if this is needed or not.

Best Regards,

Antonio Gallardo.

View raw message