lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Lucene does NOT use UTF-8.
Date Sat, 27 Aug 2005 21:03:22 GMT
>On Aug 26, 2005, at 10:14 PM, jian chen wrote:
>>It seems to me that in theory, Lucene storage code could use true UTF-8 to
>>store terms. Maybe it is just a legacy issue that the modified UTF-8 is

The use of 0xC0 0x80 to encode a U+0000 Unicode code point is an 
aspect of Java serialization of character streams. Java uses what 
they call "a modified version of UTF-8", though that's a really bad 
way to describe it. It's a different Unicode encoding, one that 
resembles UTF-8, but that's it.

>It's not a matter of a simple switch.  The VInt count at the head of 
>a Lucene string is not the number of Unicode code points the string 
>contains.  It's the number of Java chars necessary to contain that 
>string.  Code points above the BMP require 2 java chars, since they 
>must be represented by surrogate pairs.  The same code point must be 
>represented by one character in legal UTF-8.
>If Plucene counts the number of legal UTF-8 characters and assigns 
>that number as the VInt at the front of a string, when Java Lucene 
>decodes the string it will allocate an array of char which is too 
>small to hold the string.

I think Jian was proposing that Lucene switch to using a true UTF-8 
encoding, which would make things a bit cleaner. And probably easier 
than changing all references to CEUS-8 :)

And yes, given that the integer count is the number of UTF-16 code 
units required to represent the string, your code will need to do a 
bit more processing when calculating the character count, but that's 
a one-liner, right?

-- Ken
Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message