lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Lucene does NOT use UTF-8.
Date Wed, 31 Aug 2005 17:04:35 GMT
Wolfgang Hoschek wrote:
> I don't know if it matters for Lucene usage. But if using  
> CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a  
> significant problem, it's probably due to startup/init time of these  
> methods for individually converting many small strings, not  inherently 
> due to UTF-8 usage. I'm confident that a custom UTF-8  implementation 
> can almost completely eliminate these issues. I've  done this before for 
> binary XML with great success, and it could  certainly be done for 
> lucene just as well. Bottom line: It's probably  an issue that can be 
> dealt with via proper impl; it probably  shouldn't dictate design 
> directions.

Good point.  Currently Lucene already has its own (buggy) UTF-8 
implementation for performance, so that wouldn't really be a big change.

The big question now seems to be whether the stored character sequence 
lengths should be in bytes or characters.  Bytes might be fast and 
simple (whether we implement our own UTF-8 in Java or not) but are not 
back-compatible.  So do we bite the bullet and make a very incompatible 
change to index formats?  Or do we make these counts be unicode 
characters (which is mostly back-compatible) and make the code a bit 
more awkward?  Some implementations would be nice to see just how 
awkward things get.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message