harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stepan Mishura" <stepan.mish...@gmail.com>
Subject [bug-to-bug] UTF-8: interpreting non-shortest forms
Date Fri, 24 Mar 2006 06:56:46 GMT
According to Unicode standart 4.0 (since 3.0) interpretation of non-shortest
forms is forbidden for UTF-8. So if a byte sequence is not in table of
well-formed UTF-8 byte sequences then it is considered as ill-formed and
treated as error. Harmony follows Unicode spec. but RI doesn't. I didn't
find in the spec. explanation but I assume it is caused by backward
compatibility.

The following example demonstrates the difference. For example, code point
'1071' should be represented by the next UTF-8 byte sequence <D0 AF>. But it
may be represented as 3 bytes sequence: <E0 90 AF> that is its non-shortest
form. So the following code prints "ERROR" on Harmony implementation and "Ok
with non-shortest forms" on RI

        String s1 = new String(new byte[]{(byte) 0xE0, (byte) 0x90, (byte)
0xAF}, "UTF-8");
        String s2 = new String(new char[]{1071});

        if(s1.equals(s2)){
            System.out.println("Ok with non-shortest forms");
        } else {
            System.out.println("ERROR");
        }

We should decide whether we going to be compatible with RI or Unicode spec.

Thanks,
Stepan Mishura
Intel Middleware Products Division

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message