lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Date Wed, 26 Mar 2008 21:22:49 GMT

Marvin Humphrey wrote:
>> Michael McCandless resolved LUCENE-510.
> Congratulations.  :)

Thanks.  I didn't quite realize what I was getting myself into when I  
said "yes" on that issue!

> When I wrote my initial patch, I saw a performance degradation of  
> c. 30% in my indexing benchmarks.

I think it was 20%.

> Repeated reallocation was presumably one culprit: when length in  
> Java chars is stored in the index, you only need to allocate once,  
> whereas when reading in UTF-8, you can't know just how much memory  
> you need until the read completes.  Furthermore, at write-time, you  
> can't look at something composed of 16-bit chars and know what the  
> byte-length of its UTF-8 representation will be without pre-scanning.

Right, not doing allocations was pretty much it (the getBytes method  
of String was most of the slowdown I think).  I was also able to  
eliminate another per-term scan we were doing in DocumentsWriter and  
fold it into the conversion.

I ended up creating custom conversion methods (UTF8toUTF16 & vice- 
versa) to do this conversion into a re-used byte[] or char[], which  
grow as needed, then I just bulk-write the bytes.  I think this is  
not much slower than before (modified UTF8) since it also had to go  
character by character w/ ifs inside that inner loop.

I'm less happy with the 11% slowdown on TermEnum, and that's even  
with the optimization to incrementally decode only the "new" UTF-8  
bytes as we are reading the changed suffix of each term, reusing the  
already-decoded UTF16 chars from the previous term.  This will  
slowdown populating a FieldCache, which is already slow.  But  
LUCENE-831 and LUCENE-1231 should fix that.

> Are the string diffs and comparisons now performed against raw  
> bytes, so that fewer conversions are needed?

Alas, not yet: Lucene still uses UTF16 java chars internally.  The  
conversion to UTF-8 happens "at the last minute" when writing, and  
"immediately" when reading.

I started exploring keeping UTF-8 bytes further in, but it quickly  
got messy because it would require changing how the term infos are  
sorted to be unicode code point order.  Comparing bytes in UTF-8 is  
the same as comparing unicode code points, which is nice.  But  
comparing UTF-16 values is almost but not quite the same.   So  
suddenly everywhere where a string comparison takes place I had to  
assess whether that comparison should be by unicode code point, and  
call our own method for doing so.  It quickly became a "big" project  
so I ran back to sorting by UTF-16 value.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message