lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 22:33:21 GMT
>Yonik Seeley wrote:
>>A related problem exists even if the prefix length vInt is changed 
>>to represent the number of unicode chars (as opposed to number of 
>>java chars), right? The prefix length is no longer the offset into 
>>the char[] to put the suffix.
>
>Yes, I suppose this is a problem too.  Sigh.
>
>>Another approach might be to convert the target to a UTF-8 byte[] 
>>and do all comparisons on byte[]. UTF-8 has some very nice 
>>properties, including that the byte[] representation of UTF-8 
>>strings compare the same as UCS-4 would.
>
>I was not aware of that, but I see you are correct:
>
>    o  The byte-value lexicographic sorting order of UTF-8 strings is the
>       same as if ordered by character numbers.
>
>(From http://www.faqs.org/rfcs/rfc3629.html)
>
>That makes the byte representation much more palatable, since Lucene 
>orders terms lexicographically.

Where/how is the Lucene ordering of terms used?

I'm asking because people often confuse lexicographic order with 
"dictionary" order, whereas in the context of UTF-8 it just means 
"the same order as Unicode code points". And the order of Java chars 
would be the same as for Unicode code points, other than non-BMP 
characters.

Thanks,

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message