lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 21:21:10 GMT
Yonik Seeley wrote:
> A related problem exists even if the prefix length vInt is changed to 
> represent the number of unicode chars (as opposed to number of java chars), 
> right? The prefix length is no longer the offset into the char[] to put the 
> suffix.

Yes, I suppose this is a problem too.  Sigh.

> Another approach might be to convert the target to a UTF-8 byte[] 
> and do all comparisons on byte[]. UTF-8 has some very nice properties, 
> including that the byte[] representation of UTF-8 strings compare the same 
> as UCS-4 would.

I was not aware of that, but I see you are correct:

    o  The byte-value lexicographic sorting order of UTF-8 strings is the
       same as if ordered by character numbers.

(From http://www.faqs.org/rfcs/rfc3629.html)

That makes the byte representation much more palatable, since Lucene 
orders terms lexicographically.

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message