lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Lucene does NOT use UTF-8
Date Mon, 29 Aug 2005 17:56:41 GMT

>The surrogate pair problem is another matter entirely. First of all, 
>lets see if I do understand the problem correctly: Some unicode 
>characters can be represented by one codepoint outside the BMP (i. 
>e., not with 16 bits) and alternatively with two codepoints, both of 
>them in the 16-bit range.

A Unicode character has a code point, which is a scalar value in the 
range U+0000 to U+10FFFF. The code point for every character in the 
Unicode character set will fall in this range.

There are Unicode encoding schemes, which specify how Unicode code 
point values are serialized. Examples include UTF-8, UTF-16LE, 
UTF-16BE, UTF-32, UTF-7, etc.

The UTF-16 (big or little endian) encoding scheme uses two code units 
(16-bit values) to encode Unicode characters with code point values > 

>According to Marvin's explanations, the Unicode standard requires 
>these characters to be represented as "the one" codepoint in UTF-8, 
>resulting in a 4-, 5-, or 6-byte encoding for that character.

Since the Unicode code point range is constrained to 
U+0000...U+10FFFF, the longest valid UTF-8 sequence is 4 bytes.

>But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
>range cannot be represented as chars.  That is, the 
>in-memory-representation still requires the use of the surrogate 
>pairs.  Therefore, writing consists of translating the surrogate 
>pair to the >16bit representation of the same character and then 
>algorithmically encoding that.  Reading is exactly the reverse 

Yes. Writing requires that you combine the two surrogate characters 
into a Unicode code point, then converting that value into the UTF-8 
4 byte sequence.

>Adding code to handle the 4 to 6 byte encodings to the 
>readChars/writeChars method is simple, but how do you do the mapping 
>from surrogate pairs to the chars they represent? Is there an 
>algorithm for doing that except for table lookups or huge switch 

It's easy, since U+D800...U+DBFF is defined as the range for the high 
(most significant) surrogate, and U+DC00...U+DFFF is defined as the 
range for the low (least significant) surrogate.

-- Ken
Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message