harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Zhang" <zhanghuang...@gmail.com>
Subject Re: [jira] Updated: (HARMONY-308) java.nio.charset.Charset.encode(CharBuffer) returns bytes in a different order in Harmony and RI for the UTF-16 charset
Date Fri, 07 Apr 2006 11:22:05 GMT
Hello, Dmirty,

I agree with you that Harmony's behavior is not consistent with java spec.

As you may know, java.nio.charset.Charset wraps ICU to implement
encode/decode operations.

The following description is cited from ICU: (
http://icu.sourceforge.net/userguide/unicodeBasics.html)

*The names "UTF-16" and "UTF-32" are ambiguous. Depending on context, they
refer either to character encoding forms where 16/32-bit words are processed
and are naturally stored in the platform endianness, or they refer to the
IANA-registered charset names, i.e., to character encoding schemes or byte
serializations. In addition to simple byte serialization, the charsets with
these names also use optional Byte Order Marks (see **Serialized
Formats*<http://icu.sourceforge.net/userguide/unicodeBasics.html#serialized_formats>
* below).*

The result of running your test case on IBM jdk 1.4.2  is exactly the same
as on Harmony. I guess IBM jdk 1.4.2 has passed TCK.

Therefore, IMO, both behaviours are acceptable.

What's your opinion?

On 4/7/06, Dmitry M. Kononov <dmitry.m.kononov@gmail.com> wrote:
>
> Hi Richard,
>
> On 4/6/06, Richard Liang <richard.liangyx@gmail.com> wrote:
>
> > Dmitry M. Kononov wrote:
> > > As you exactly noticed the cause of this issue that Harmony uses the
> > > little-endian byte order, if an encoded UTF-16 sequence has no
> > byte-order
> > > mark. However, the spec reads such a case explicitly as follows:
> > >
> > > "When decoding, the UTF-16 charset interprets a byte-order mark to
> > indicate
> > > the byte order of the stream but defaults to big-endian if there is no
> > > byte-order mark; when encoding, it uses big-endian byte order and
> writes
> > a
> > > big-endian byte-order mark."
> > >
> > >
> > Hello Dmitry,
> >
> > Yes, although Harmony and RI use different byte order, as both Harmony
> > and RI use byte-order mark (U+FEFF), I think both Harmony and RI are
> > compliant with the specification. So could we regard Harmony-308 as "not
> > a bug"?
>
>
> I think Harmony's behavior in this case is inconsistent with the java
> spec,
> since the spec defines the expected behavior explicitly:
> "when encoding, it uses big-endian byte order and writes a big-endian
> byte-order mark." But Harmony's encode() returns bytes in the
> little-endian
> order.
>
> It seems I do not understand why do you think Harmony follows the spec
> correctly in this case? :)
> I am really sorry for my misunderstanding.
>
> From a test case attached to the HARMONY-308:
>
> 1) We have a char array that has no byte-order mark:
>    private static final char chars[] = {
>
> 0x041b,0x0435,0x0442,0x043e,0x0020,0x0432,0x0020,0x0420,0x043e,0x0441,
>        0x0441,0x0438,0x0438};
>
> 2) We have a byte array that encode() should return as we expect.
>    private static final byte bytes[] = {
>        (byte)254,(byte)255,(byte)  4,(byte) 27,(byte)  4,(byte) 53,(byte)
> 4,
>        (byte) 66,(byte)  4,(byte) 62,(byte)  0,(byte) 32,(byte)  4,(byte)
> 50,
>        (byte)  0,(byte) 32,(byte)  4,(byte) 32,(byte)  4,(byte) 62,(byte)
> 4,
>        (byte) 65,(byte)  4,(byte) 65,(byte)  4,(byte) 56,(byte)  4,(byte)
> 56};
>
> Please note, according to the spec we expect bytes returned by encode() in
> big-endian byte order. So, we expect the FEFF byte-order mark.
> Do you agree this expectation is correct and consistent with the spec?
>
> Thanks.
> --
> Dmitry M. Kononov
> Intel Managed Runtime Division
>
>
--
Andrew Zhang
China Software Development Lab, IBM

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message