lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 16:50:28 GMT wrote:
> How will the difference impact String memory allocations?  Looking at 
> the String code, I can't see where it would make an impact.

I spoke a bit too soon.  I should have looked at the code first.  You're 
right, I don't think it would require more allocations.

When considering this byte-count versus character-count issue please 
note that it also arises elsewhere.  The PrefixLength in the Term 
Dictionary section of the file format document is currently defined as a 
number of characters, not bytes. Dictionary

Implementing this in terms of bytes may have performance implications, 
since, at first glance, the entire byte sequence would need to be 
converted from UTF-8 into the internal string representation for each 
term, rather than just the suffix.  Does anyone see a way around that?

As for how we got to this point: I wrote Lucene's UTF-8 reading and 
writing code in 1998, back when Unicode still had fewer than 2^16 
characters.  It's surprising that it has lasted this long without anyone 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message