lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: Lucene does NOT use UTF-8
Date Tue, 30 Aug 2005 17:36:39 GMT
DM Smith wrote:
> Daniel Naber wrote:
>> But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to 
>> be the case.
>>
> UTF-16 is a fixed 2 byte/char representation.

Except when it's not.  I.e., above the BMP.

 From the Unicode 4.0 standard 
<http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf>:

    In the UTF-16 encoding form, code points in the
    range U+0000..U+FFFF are represented as a single
    16-bit code unit; code points in the supplementary
    planes, in the range U+10000..U+10FFFF, are
    instead represented as pairs of 16-bit code units.
    These pairs of special code units are known as
    surrogate pairs.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message