cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Gallardo <agalla...@agssa.net>
Subject Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Date Mon, 05 Sep 2005 00:53:40 GMT
Pier Fumagalli wrote:

> On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:
>
>> pier@apache.org wrote:
>>
>>> Author: pier
>>> Date: Sun Sep  4 16:29:09 2005
>>> New Revision: 278641
>>>
>>> URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
>>> Log:
>>> Fixing wrong encoding bug
>>>
>>> Modified:
>>>    cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ 
>>> cocoon/components/language/markup/xsp/XSPExpressionParser.java
>>>
>>> @@ -211,7 +211,7 @@
>>>                     parser.setState(EXPRESSION_CHAR_STATE);
>>>                     break;
>>> -                case '�':
>>> +                case '\u00B4':
>>>                     parser.append(ch);
>>>                     parser.setState(EXPRESSION_SHELL_STATE);
>>>                     break;
>>> @@ -235,10 +235,10 @@
>>>     protected static final State EXPRESSION_CHAR_STATE = new  
>>> QuotedState('\'');
>>>     /**
>>> -     * The parser has encountered '�' in <code>{@link  
>>> EXPRESSION_STATE}</code>
>>> -     * to start a Python string constant.
>>> +     * The parser has encountered '\u00B4' (Unicode Latin-1 Acute  
>>> Accent) in
>>> +     * <code>{@link EXPRESSION_STATE}</code> to start a Python 

>>> string constant.
>>>      */
>>> -    protected static final State EXPRESSION_SHELL_STATE = new  
>>> QuotedState('�');
>>> +    protected static final State EXPRESSION_SHELL_STATE = new  
>>> QuotedState('\u00B4');
>>>
>>>
>> Why not only left the original char as it was before your first  
>> change? It was working. Having a UTF-8 IMO is not good.
>
>
> It's not a UTF-8 character, it's an UNICODE character: \u doesn't  
> mean "UTF" but rather "UNICODE" (which is not an encoding).

First, I request excuses because I wrote up the previous phrase very 
badly. I wanted to state that I don't see a reason to use Java "Unicode 
escaping" for this case. Reading the Java Specification, we found [0]: " 
Programs are written using the/ /Unicode character set.". So IMO a 
UNICODE 00B4, the Acute Accent in Latin-1, should be only represented by 
only one code.

> Depending on your platform encoding (yours apparently ISO8859-1, mine  
> UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of  
> BYTES as in the original source code will be interpreted as a  
> different character.

The char encoding Shift-JIS (JIS X 0201:1997 or JIS X 0208:1997) is is 
exactly the same as using ISO-8859-1. We need to keep the sources in 
UNICODE and there is also for Japanese: Hiragana, Katakana, et al: 
http://www.unicode.org/charts/

>
> Changing the binary sequence B4 to \u00B4 instructs the JVM that no  
> matter what encoding your platform is set to, the resulting character  
> will always (always) be UNICODE 00B4, the Acute Accent, part of the  
> Latin-1 (0X0080) table.

If we wrote the code in UNICODE you will have the same effect. It is 
exactly the same as with XML, isn't?

> Let's call it defensive programming, and actually, in the source  
> code, we should be using only characters in the range 00-7F (Unicode  
> BASIC-Latin, encoding US-ASCII), as that's the "most-common" amongst  
> all different encodings (even if when thinking about IBM's EBCDIC,  
> even that one might have some problems in some cases).

I am sorry, but I do not like to cover the sun with a finger.

I believe Thorsten Schalab can tell us more about this topic. ;-)

Best Regards,

Antonio Gallardo.

[0] 
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413 


Mime
View raw message