maven-doxia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hervé BOUTEMY <herve.bout...@free.fr>
Subject Re: svn commit: r712146 - /maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sink/SinkTestDocument.java
Date Fri, 07 Nov 2008 16:39:54 GMT
-1: please revert this change

String.valueOf( '\u00a9' ) represents the Copyright symbol: there is no 
encoding notion available here.
=> the code was perfectly correct and did what was expected

I'll try to explain what the new code does:
1. String.valueOf( '\u00a9' ): gets the Copyright symbol (as previously)
2. .getBytes(): gets binary representation of this symbol IN PLATFORM ENCODING 
(see [1] API), result vary depending on your platform
3. new String( ..., "UTF-8" ): interprets previous binary data using UTF-8 
encoding

Then the result vary depending on your platform encoding:
- if it was UTF-8 (like my Linux box), you finally have the initial Copyright 
symbol: the code just turned the character to bytes then back to character
- if it was another encoding, you get something that is not Copyright symbol, 
and might even not be anything valid

let's do the transofmration in case you're on Windows, in west Europe.
Platform encoding is then CP-1252.
Copyright symbol in CP-1252 is 0xA9=10101001 (see [2]).
In UTF-8, a byte starting with binary 10 tells that it is a continuation of a 
multi-bytes serie (see [3], or [4] simpler explanation in the french page for 
people reading french ;) ): then we're facing an invalid byte sequence, since 
we didn't have any previous byte.
Write a little program with:
String copyright = String.valueOf( '\u00a9' );
byte[] b = copyright.getBytes( "CP1252" );
String result = new String( b, "UTF-8" );
byte[] b2 = result.getBytes( "UTF-8" );

and run it step by step in a debugger, looking into the variables, and you'll 
see that b contains 1 byte but b2 contains 3 bytes. And copyright is 
perfectly displayed while result is not.

I hope these explanations will help to understand.

Regards,

Hervé


[1] http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()

[2] http://fr.wikipedia.org/wiki/Windows-1252

[3] http://en.wikipedia.org/wiki/UTF-8

[4] http://fr.wikipedia.org/wiki/UTF-8

Le vendredi 07 novembre 2008, vsiveton@apache.org a écrit :
> Author: vsiveton
> Date: Fri Nov  7 07:01:52 2008
> New Revision: 712146
>
> URL: http://svn.apache.org/viewvc?rev=712146&view=rev
> Log:
> o be sure that UTF-8 will be used
>
> Modified:
>    
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java
>
> Modified:
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java URL:
> http://svn.apache.org/viewvc/maven/doxia/doxia/trunk/doxia-core/src/test/ja
>va/org/apache/maven/doxia/sink/SinkTestDocument.java?rev=712146&r1=712145&r2
>=712146&view=diff
> ===========================================================================
>=== ---
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java (original) +++
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java Fri Nov  7 07:01:52 2008 @@ -19,6 +19,7 @@
>   * under the License.
>   */
>
> +import java.io.UnsupportedEncodingException;
>
>  /**
>   * Static methods to generate standard Doxia sink events.
> @@ -595,7 +596,15 @@
>          sink.paragraph_();
>
>          sink.paragraph();
> -        String copyright = String.valueOf( '\u00a9' );
> +        String copyright;
> +        try
> +        {
> +            copyright = new String( String.valueOf( '\u00a9' ).getBytes(),
> "UTF-8" ); +        }
> +        catch ( UnsupportedEncodingException e )
> +        {
> +            copyright = "";
> +        }
>          sink.text( "Copyright symbol: " + copyright + ", "
>              + copyright + ", " + copyright + "." );
>          sink.paragraph_();



Mime
View raw message