lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <reng...@ix.netcom.com>
Subject Re: possible segment merge improvement?
Date Thu, 01 Nov 2007 06:30:35 GMT
Actually, a bit better signatures would use method overloading and be

int FieldsReader.length(int doc); // length of document in bytes
int FieldsReader.doc(int doc,byte[] buffer); // read a formatted  
document into a buffer

void FieldsWriter.addDocument(byte[] buffer, int len); // write an  
already formatted document from a buffer


On Nov 1, 2007, at 1:06 AM, robert engels wrote:

> It seems that the following are needed:
>
> FieldInfos.hashCode(); // to allow for fast equals failure
> FieldInfos.equals();
>
> for most efficient buffer reuse during merge to avoid GC, add
>
> int FieldsReader.doclength(int doc);
> int FieldsReader.binarydoc(int doc,byte[] buffer);
>
> this will allow the caller to reuse the existing buffer if large  
> enough, or create a new one
>
> and
>
> FieldsWriter.addBinaryDocument(byte[] buffer,int len);
>
> All of the above methods are trivial.
>
> SegmentMerger just needs to be changed to compare the readers to be  
> merged, and if all have equal FieldInfos, then use a short circuit  
> copy similar to
>
> byte[] buffer = new byte[1024];
>
> for each reader {
>     for doc in reader {
> 	    if doc not deleted {
>            	int len = reader.doclength(doc);
>                 if(len > buffer.length) {
>                         buffer = new byte[len*2]; // allow for growth
> 		}
>                 reader.binarydoc(doc,buffer);
>                 newsegment.addBinaryDocument(buffer,len);
>           }
>     }
> }
>
>
>
> On Nov 1, 2007, at 12:30 AM, jian chen wrote:
>
>> Hi, Robert,
>>
>> That's a brilliant idea! Thanks so much for suggesting that.
>>
>> Cheers,
>>
>> Jian
>>
>> On 10/31/07, robert engels <rengels@ix.netcom.com> wrote:
>>>
>>> Currently, when merging segments, every document is [parsed and then
>>> rewritten since the field numbers may differ between the segments
>>> (compressed data is not uncompressed in the latest versions).
>>>
>>> It would seem that in many (if not most) Lucene uses the fields
>>> stored within each document with an index are relatively static,
>>> probably changing for all documents added after point X, if at all.
>>>
>>> Why not check the fields dictionary for the segments being merged,
>>> and if the same, just copy the binary data directly?
>>>
>>> In the common case this should be a vast improvement.
>>>
>>> Anyone worked on anything like this? Am I missing something?
>>>
>>> Robert Engels
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message