lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <>
Subject Re: possible segment merge improvement?
Date Thu, 01 Nov 2007 06:06:07 GMT
It seems that the following are needed:

FieldInfos.hashCode(); // to allow for fast equals failure

for most efficient buffer reuse during merge to avoid GC, add

int FieldsReader.doclength(int doc);
int FieldsReader.binarydoc(int doc,byte[] buffer);

this will allow the caller to reuse the existing buffer if large  
enough, or create a new one


FieldsWriter.addBinaryDocument(byte[] buffer,int len);

All of the above methods are trivial.

SegmentMerger just needs to be changed to compare the readers to be  
merged, and if all have equal FieldInfos, then use a short circuit  
copy similar to

byte[] buffer = new byte[1024];

for each reader {
     for doc in reader {
	    if doc not deleted {
            	int len = reader.doclength(doc);
                 if(len > buffer.length) {
                         buffer = new byte[len*2]; // allow for growth

On Nov 1, 2007, at 12:30 AM, jian chen wrote:

> Hi, Robert,
> That's a brilliant idea! Thanks so much for suggesting that.
> Cheers,
> Jian
> On 10/31/07, robert engels <> wrote:
>> Currently, when merging segments, every document is [parsed and then
>> rewritten since the field numbers may differ between the segments
>> (compressed data is not uncompressed in the latest versions).
>> It would seem that in many (if not most) Lucene uses the fields
>> stored within each document with an index are relatively static,
>> probably changing for all documents added after point X, if at all.
>> Why not check the fields dictionary for the segments being merged,
>> and if the same, just copy the binary data directly?
>> In the common case this should be a vast improvement.
>> Anyone worked on anything like this? Am I missing something?
>> Robert Engels
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message