From Ken Krugler <>
Subject Re: Lucene does NOT use UTF-8
Date Mon, 29 Aug 2005 17:56:46 GMT
>On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>>>I'm not familiar with UTF-8 enough to follow the details of this
>>>discussion.  I hope other Lucene developers are, so we can resolve this
>>>issue.... anyone raising a hand?
>>I could, but recent posts makes me think this is heading towards a 
>>religious debate :)
>Ken - you mentioned taking the discussion off-line in a previous 
>post.  Please don't.  Let's keep it alive on java-dev until we have 
>a resolution to it.
>>I think the following statements are all true:
>>a. Using UTF-8 for strings would make it easier for Lucene indexes 
>>to be used by other implementations besides the reference Java 
>>b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.
>What, if any, performance impact would changing Java Lucene in this 
>regard have?   (I realize this is rhetorical at this point, until a 
>solution is at hand)

Almost zero. A tiny hit when reading/writing surrogate pairs, to 
properly encode them as a 4 byte UTF-8 sequence versus two 3-byte 

>>c. The hard(er) part would be backwards compatibility with older 
>>indexes. I haven't looked at this enough to really know, but one 
>>example is the compound file (xx.cfs) format...I didn't see a 
>>version number, and it contains strings.
>I don't know the gory details, but we've made compatibility breaking 
>changes in the past and the current version of Lucene can open older 
>formats, but only write the most current format.  I suspect it could 
>be made to be backwards compatible.  Worst case, we break 
>compatibility in 2.0.

Ronald is correct in that it would be easy to make the reader handle 
both "Java modified UTF-8" and UTF-8, and the writer always output 
UTF-8. So the only problem would be if older versions of Lucene (or 
maybe CLucene) wound up trying to read strings that contained 4-byte 
UTF-8 sequences, as they wouldn't know how to convert this into two 
UTF-16 Java chars.

Since 4-byte UTF-8 sequences are only for characters outside of the 
BMP, and these are rare, it seems like an OK thing to do, but that's 
just my uninformed view.

>>d. The documentation could be clearer on what is meant by the 
>>"string length", but this is a trivial change.
>That change was made by Daniel soon after this discussion began.

Daniel changed the definition of Chars, but String section still 
needs to be clarified. Currently it says:

"Lucene writes strings as a VInt representing the length, followed by 
the character data".

It should read:

"Lucene writes strings as a VInt representing the length of the 
string in Java chars (UTF-16 code units), followed by the character 

-- Ken
Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

