harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Liang <richard.lian...@gmail.com>
Subject Re: [bug-to-bug] UTF-8: interpreting non-shortest forms
Date Mon, 27 Mar 2006 01:20:39 GMT
Nathan Beyer wrote:
> I've seen similar differences between other VMs around the handling of UTF-8
> encoded data, especially between Sun and IBM VMs.  For example, if you read
> a file with a UTF-8 encoding that contains an invalid byte(s), the IBM VM
> will throw an IOException, but the Sun VM will convert the invalid byte(s)
> into the Unicode unknown character (diamond-backed-question-mark).
>
> Personally, I prefer VMs that explicitly stick to Unicode and the various
> encodings and indicate error conditions.
>
>   
Hello Nathan,

+1, we shall stick to Unicode and various encodings.
> -Nathan
>
>   
>> -----Original Message-----
>> From: Stepan Mishura [mailto:stepan.mishura@gmail.com]
>> Sent: Friday, March 24, 2006 12:57 AM
>> To: harmony-dev
>> Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms
>>
>> According to Unicode standart 4.0 (since 3.0) interpretation of non-
>> shortest
>> forms is forbidden for UTF-8. So if a byte sequence is not in table of
>> well-formed UTF-8 byte sequences then it is considered as ill-formed and
>> treated as error. Harmony follows Unicode spec. but RI doesn't. I didn't
>> find in the spec. explanation but I assume it is caused by backward
>> compatibility.
>>
>> The following example demonstrates the difference. For example, code point
>> '1071' should be represented by the next UTF-8 byte sequence <D0 AF>. But
>> it
>> may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
>> shortest
>> form. So the following code prints "ERROR" on Harmony implementation and
>> "Ok
>> with non-shortest forms" on RI
>>
>>         String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90, (byte)
>> 0xAF}, "UTF-8");
>>         String s2 = new String(new char[]{1071});
>>
>>         if(s1.equals(s2)){
>>             System.out.println("Ok with non-shortest forms");
>>         } else {
>>             System.out.println("ERROR");
>>         }
>>
>> We should decide whether we going to be compatible with RI or Unicode
>> spec.
>>
>> Thanks,
>> Stepan Mishura
>> Intel Middleware Products Division
>>     
>
>
>   


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message