lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 20:52:21 GMT
> The inefficiency would be if prefix were re-converted from UTF-8
> for each term, e.g., in order to compare it to the target.

Ahhh, gotcha.

A related problem exists even if the prefix length vInt is changed to 
represent the number of unicode chars (as opposed to number of java chars), 
right? The prefix length is no longer the offset into the char[] to put the 
suffix.

Another approach might be to convert the target to a UTF-8 byte[] 
and do all comparisons on byte[]. UTF-8 has some very nice properties, 
including that the byte[] representation of UTF-8 strings compare the same 
as UCS-4 would.

As you say, the variations need to be tested.

-Yonik 
Now hiring -- http://tinyurl.com/7m67g

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message