harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry M. Kononov" <dmitry.m.kono...@gmail.com>
Subject Re: [jira] Updated: (HARMONY-308) java.nio.charset.Charset.encode(CharBuffer) returns bytes in a different order in Harmony and RI for the UTF-16 charset
Date Fri, 07 Apr 2006 12:44:00 GMT
Hi Andrew,

On 4/7/06, Andrew Zhang <zhanghuangzhu@gmail.com> wrote:
> Hello, Dmirty,
> I agree with you that Harmony's behavior is not consistent with java spec.


As you may know, java.nio.charset.Charset wraps ICU to implement
> encode/decode operations.
> The following description is cited from ICU: (
> http://icu.sourceforge.net/userguide/unicodeBasics.html)
> *The names "UTF-16" and "UTF-32" are ambiguous. Depending on context, they
> refer either to character encoding forms where 16/32-bit words are
> processed
> and are naturally stored in the platform endianness, or they refer to the
> IANA-registered charset names, i.e., to character encoding schemes or byte
> serializations. In addition to simple byte serialization, the charsets
> with
> these names also use optional Byte Order Marks (see **Serialized
> Formats*<
> http://icu.sourceforge.net/userguide/unicodeBasics.html#serialized_formats
> >
> * below).*
> Thanks, it's a good point. However, I found the following text in this
document that let us think that there is a bug in ICU. Please note the
latest sentence, that describes our case exactly, I believe:

"In UTF-16 and UTF-32, where the signature also distinguishes between
big-endian and little-endian byte orders, it is also called a byte order
mark (BOM). The signature works for UTF-16 since the code point that has the
byte-swapped encoding, FFFE16, will never be a valid Unicode character. (It
is a "non-character" code point.) In Internet protocols, if an encoding
specification of "UTF-16" or "UTF-32" is used, it is expected that there is
a signature byte sequence (BOM) that identifies the byte ordering, which is
not the case for the encoding scheme/charset names with "BE" or "LE".
If text is specified to be encoded in the UTF-16 or UTF-32 charset and does
not begin with a BOM, then it must be interpreted as UTF-16BE or UTF-32BE,

Harmony and IBM jdk1.4.2 use the ICU to provide
java.nio.charsetfunctionality. So, they have the same behavior in our
case. This behavior
does not follow the java documentation (or I something don't understand :)
). Thus, we probably need to ask about fixing the ICU, don't we?

What do you think, does it make sense to file a bug against ICU?
Dmitry M. Kononov
Intel Managed Runtime Division

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message