lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <>
Subject Re: Lucene does NOT use UTF-8
Date Tue, 30 Aug 2005 17:36:39 GMT
DM Smith wrote:
> Daniel Naber wrote:
>> But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to 
>> be the case.
> UTF-16 is a fixed 2 byte/char representation.

Except when it's not.  I.e., above the BMP.

 From the Unicode 4.0 standard 

    In the UTF-16 encoding form, code points in the
    range U+0000..U+FFFF are represented as a single
    16-bit code unit; code points in the supplementary
    planes, in the range U+10000..U+10FFFF, are
    instead represented as pairs of 16-bit code units.
    These pairs of special code units are known as
    surrogate pairs.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message