lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Date Wed, 26 Mar 2008 20:56:26 GMT

> Michael McCandless resolved LUCENE-510.

Congratulations.  :)

When I wrote my initial patch, I saw a performance degradation of c.  
30% in my indexing benchmarks.  Repeated reallocation was presumably  
one culprit: when length in Java chars is stored in the index, you  
only need to allocate once, whereas when reading in UTF-8, you can't  
know just how much memory you need until the read completes.   
Furthermore, at write-time, you can't look at something composed of 16- 
bit chars and know what the byte-length of its UTF-8 representation  
will be without pre-scanning.

How did you solve those problems?  Are the string diffs and  
comparisons now performed against raw bytes, so that fewer conversions  
are needed?

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message