lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 22:33:21 GMT
>Yonik Seeley wrote:
>>A related problem exists even if the prefix length vInt is changed 
>>to represent the number of unicode chars (as opposed to number of 
>>java chars), right? The prefix length is no longer the offset into 
>>the char[] to put the suffix.
>Yes, I suppose this is a problem too.  Sigh.
>>Another approach might be to convert the target to a UTF-8 byte[] 
>>and do all comparisons on byte[]. UTF-8 has some very nice 
>>properties, including that the byte[] representation of UTF-8 
>>strings compare the same as UCS-4 would.
>I was not aware of that, but I see you are correct:
>    o  The byte-value lexicographic sorting order of UTF-8 strings is the
>       same as if ordered by character numbers.
>That makes the byte representation much more palatable, since Lucene 
>orders terms lexicographically.

Where/how is the Lucene ordering of terms used?

I'm asking because people often confuse lexicographic order with 
"dictionary" order, whereas in the context of UTF-8 it just means 
"the same order as Unicode code points". And the order of Java chars 
would be the same as for Unicode code points, other than non-BMP 


-- Ken
Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message