cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pier Fumagalli <p...@betaversion.org>
Subject Re: svn commit: r278641 - /cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/cocoon/components/language/markup/xsp/XSPExpressionParser.java
Date Sun, 04 Sep 2005 23:45:44 GMT
On 5 Sep 2005, at 00:33, Antonio Gallardo wrote:
> pier@apache.org wrote:
>
>> Author: pier
>> Date: Sun Sep  4 16:29:09 2005
>> New Revision: 278641
>>
>> URL: http://svn.apache.org/viewcvs?rev=278641&view=rev
>> Log:
>> Fixing wrong encoding bug
>>
>> Modified:
>>    cocoon/branches/BRANCH_2_1_X/src/blocks/xsp/java/org/apache/ 
>> cocoon/components/language/markup/xsp/XSPExpressionParser.java
>>
>> @@ -211,7 +211,7 @@
>>                     parser.setState(EXPRESSION_CHAR_STATE);
>>                     break;
>> -                case '�':
>> +                case '\u00B4':
>>                     parser.append(ch);
>>                     parser.setState(EXPRESSION_SHELL_STATE);
>>                     break;
>> @@ -235,10 +235,10 @@
>>     protected static final State EXPRESSION_CHAR_STATE = new  
>> QuotedState('\'');
>>     /**
>> -     * The parser has encountered '�' in <code>{@link  
>> EXPRESSION_STATE}</code>
>> -     * to start a Python string constant.
>> +     * The parser has encountered '\u00B4' (Unicode Latin-1 Acute  
>> Accent) in
>> +     * <code>{@link EXPRESSION_STATE}</code> to start a Python  
>> string constant.
>>      */
>> -    protected static final State EXPRESSION_SHELL_STATE = new  
>> QuotedState('�');
>> +    protected static final State EXPRESSION_SHELL_STATE = new  
>> QuotedState('\u00B4');
>>
>>
> Why not only left the original char as it was before your first  
> change? It was working. Having a UTF-8 IMO is not good.

It's not a UTF-8 character, it's an UNICODE character: \u doesn't  
mean "UTF" but rather "UNICODE" (which is not an encoding).

Depending on your platform encoding (yours apparently ISO8859-1, mine  
UTF-8, my wife's -she's japanese- Shift-JIS) that sequence (B4) of  
BYTES as in the original source code will be interpreted as a  
different character.

Changing the binary sequence B4 to \u00B4 instructs the JVM that no  
matter what encoding your platform is set to, the resulting character  
will always (always) be UNICODE 00B4, the Acute Accent, part of the  
Latin-1 (0X0080) table.

Let's call it defensive programming, and actually, in the source  
code, we should be using only characters in the range 00-7F (Unicode  
BASIC-Latin, encoding US-ASCII), as that's the "most-common" amongst  
all different encodings (even if when thinking about IBM's EBCDIC,  
even that one might have some problems in some cases).

     Pier


Mime
View raw message