lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <>
Subject Re: Lucene does NOT use UTF-8.
Date Wed, 31 Aug 2005 07:25:46 GMT
On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote:

> Yonik Seeley wrote:
>> I've been looking around... do you have a pointer to the source  
>> where just the suffix is converted from UTF-8?
>> I understand the index format, but I'm not sure I understand the  
>> problem that would be posed by the prefix length being a byte count.
> Things could work fine if the prefix length were a byte count.  A  
> byte buffer could easily be constructed that contains the full byte  
> sequence (prefix + suffix), and then this could be converted to a  
> String.  The inefficiency would be if prefix were re-converted from  
> UTF-8 for each term, e.g., in order to compare it to the target.   
> Prefixes are frequently longer than suffixes, so this could be  
> significant.  Does that make sense?  I don't know whether it would  
> actually be significant, although was added  
> recently as a measurable performance enhancement, so this is  
> performance critical code.
> We need to stop discussing this in the abstract and start coding  
> alternatives and benchmarking them.  Is  
> java.nio.charset.CharsetEncoder fast enough?  Will moving things  
> through CharBuffer and ByteBuffer be too slow?  Should Lucene keep  
> maintaining its own UTF-8 implementation for performance?  I don't  
> know, only some experiments will tell.
> Doug

I don't know if it matters for Lucene usage. But if using  
CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a  
significant problem, it's probably due to startup/init time of these  
methods for individually converting many small strings, not  
inherently due to UTF-8 usage. I'm confident that a custom UTF-8  
implementation can almost completely eliminate these issues. I've  
done this before for binary XML with great success, and it could  
certainly be done for lucene just as well. Bottom line: It's probably  
an issue that can be dealt with via proper impl; it probably  
shouldn't dictate design directions.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message