lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 16:25:17 GMT
> How will the difference impact String memory allocations? Looking at the
> String code, I can't see where it would make an impact.

This is from Lucene InputStream:
public final String readString() throws IOException {
int length = readVInt();
if (chars == null || length > chars.length)
chars = new char[length];
readChars(chars, 0, length);
return new String(chars, 0, length);

If you know the length in bytes, you still have to allocate that many chars 
(even though the number of chars may be less than the number of bytes). Not 
a big deal IMHO.

A bigger pain is on the writing side, where you can't stream things because 
you don't know what the length is going to be (in either bytes *or* UTF-8 

So it turns out that Java's 16 bit chars were just a waste... it's still a 
multibyte format *and* it takes up more space. UTF-8 would have been nice - 
no conversions necessary.

-Yonik Now hiring --

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message