harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Liang <richard.lian...@gmail.com>
Subject Re: [bug-to-bug] UTF-8: interpreting non-shortest forms
Date Mon, 27 Mar 2006 11:58:29 GMT
Stepan Mishura wrote:
> On 3/27/06, Richard Liang wrote:
>   
>> Nathan Beyer wrote:
>>     
>>> I've seen similar differences between other VMs around the handling of
>>>       
>> UTF-8
>>     
>>> encoded data, especially between Sun and IBM VMs.  For example, if you
>>>       
>> read
>>     
>>> a file with a UTF-8 encoding that contains an invalid byte(s), the IBM
>>>       
>> VM
>>     
>>> will throw an IOException, but the Sun VM will convert the invalid
>>>       
>> byte(s)
>>     
>>> into the Unicode unknown character (diamond-backed-question-mark).
>>>
>>> Personally, I prefer VMs that explicitly stick to Unicode and the
>>>       
>> various
>>     
>>> encodings and indicate error conditions.
>>>
>>>
>>>       
>> Hello Nathan,
>>
>> +1, we shall stick to Unicode and various encodings.
>>     
>
>
>
> For me it is not obvious and I cannot make the choice.
> Let's review the next theoretical situation: if the next Unicode spec.
> update or corrigendum will require update that break Harmony backward
> compatibility. Should we stick to the new Unicode version or be backward
> compatible?
>
>   
Hello Stepan,

For this situation, we may have three options:
1. Compliant with the new version of Unicode Spec
2. Compliant with the original version of Unicode Spec
3. Compliant with the new version of Unicode Spec and simultaneously 
keep some violation

I think 1 & 2 may be the proper answers, but 3 is not.

Let's think why we support Unicode. IMHO, it's because Unicode is a 
bridge to ensure interoperability of applications from different 
encoding system. If we announce that we support one version of Unicode 
and simultaneously keep some violation. How can we ensure the 
interoperability with other applications? 
> Thanks,
> Stepan.
>
>   
>> -Nathan
>>     
>>>       
>>>> -----Original Message-----
>>>> From: Stepan Mishura [mailto:stepan.mishura@gmail.com]
>>>> Sent: Friday, March 24, 2006 12:57 AM
>>>> To: harmony-dev
>>>> Subject: [bug-to-bug] UTF-8: interpreting non-shortest forms
>>>>
>>>> According to Unicode standart 4.0 (since 3.0) interpretation of non-
>>>> shortest
>>>> forms is forbidden for UTF-8. So if a byte sequence is not in table of
>>>> well-formed UTF-8 byte sequences then it is considered as ill-formed
>>>>         
>> and
>>     
>>>> treated as error. Harmony follows Unicode spec. but RI doesn't. I
>>>>         
>> didn't
>>     
>>>> find in the spec. explanation but I assume it is caused by backward
>>>> compatibility.
>>>>
>>>> The following example demonstrates the difference. For example, code
>>>>         
>> point
>>     
>>>> '1071' should be represented by the next UTF-8 byte sequence <D0 AF>.
>>>>         
>> But
>>     
>>>> it
>>>> may be represented as 3 bytes sequence: <E0 90 AF> that is its non-
>>>> shortest
>>>> form. So the following code prints "ERROR" on Harmony implementation
>>>>         
>> and
>>     
>>>> "Ok
>>>> with non-shortest forms" on RI
>>>>
>>>>         String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90,
>>>>         
>> (byte)
>>     
>>>> 0xAF}, "UTF-8");
>>>>         String s2 = new String(new char[]{1071});
>>>>
>>>>         if(s1.equals(s2)){
>>>>             System.out.println("Ok with non-shortest forms");
>>>>         } else {
>>>>             System.out.println("ERROR");
>>>>         }
>>>>
>>>> We should decide whether we going to be compatible with RI or Unicode
>>>> spec.
>>>>
>>>> Thanks,
>>>> Stepan Mishura
>>>> Intel Middleware Products Division
>>>>
>>>>         
>>>
>>>       
>>
>>     
>
>
> --
> Thanks,
> Stepan Mishura
> Intel Middleware Products Division
>
>   


-- 
Richard Liang
China Software Development Lab, IBM 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message