lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 19:47:54 GMT
Yonik Seeley wrote:
> I've been looking around... do you have a pointer to the source where just 
> the suffix is converted from UTF-8?
> 
> I understand the index format, but I'm not sure I understand the problem 
> that would be posed by the prefix length being a byte count.

TermBuffer.java:66

Things could work fine if the prefix length were a byte count.  A byte 
buffer could easily be constructed that contains the full byte sequence 
(prefix + suffix), and then this could be converted to a String.  The 
inefficiency would be if prefix were re-converted from UTF-8 for each 
term, e.g., in order to compare it to the target.  Prefixes are 
frequently longer than suffixes, so this could be significant.  Does 
that make sense?  I don't know whether it would actually be significant, 
although TermBuffer.java was added recently as a measurable performance 
enhancement, so this is performance critical code.

We need to stop discussing this in the abstract and start coding 
alternatives and benchmarking them.  Is java.nio.charset.CharsetEncoder 
fast enough?  Will moving things through CharBuffer and ByteBuffer be 
too slow?  Should Lucene keep maintaining its own UTF-8 implementation 
for performance?  I don't know, only some experiments will tell.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message