lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 16:50:28 GMT
tjones@apache.org wrote:
> How will the difference impact String memory allocations?  Looking at 
> the String code, I can't see where it would make an impact.

I spoke a bit too soon.  I should have looked at the code first.  You're 
right, I don't think it would require more allocations.

When considering this byte-count versus character-count issue please 
note that it also arises elsewhere.  The PrefixLength in the Term 
Dictionary section of the file format document is currently defined as a 
number of characters, not bytes.

http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary

Implementing this in terms of bytes may have performance implications, 
since, at first glance, the entire byte sequence would need to be 
converted from UTF-8 into the internal string representation for each 
term, rather than just the suffix.  Does anyone see a way around that?

As for how we got to this point: I wrote Lucene's UTF-8 reading and 
writing code in 1998, back when Unicode still had fewer than 2^16 
characters.  It's surprising that it has lasted this long without anyone 
noticing!

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message