lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: improve how IndexWriter uses RAM to buffer added documents
Date Fri, 06 Apr 2007 01:45:11 GMT

On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:

>> What we need to do is cut down on decompression and conflict
>> resolution costs when reading from one segment to another.  KS has
>> solved this problem for stored fields.  Field defs are global and
>> field values are keyed by name rather than field number in the field
>> data file.  Benefits:
>>    * Whole documents can be read from one segment to
>>      another as blobs.
>>    * No flags byte.
>>    * No remapping of field numbers.
>>    * No conflict resolution at all.
>>    * Compressed, uncompressed... doesn't matter.
>>    * Less code.
>>    * The possibility of allowing the user to provide their
>>      own subclass for reading and writing fields. (For
>>      Lucy, in the language of your choice.)
> I hear you, and I really really love those benefits, but, we just
> don't have this freedom with Lucene.

Yeah, too bad.  This is one area where Lucene and Lucy are going to  
differ.  Balmain and I are of one mind about global field defs.

> I think the ability to suddenly birth a new field,

You can do that in KS as of version 0.20_02.  :)

> or change a field's attributes like "has vectors", "stores norms",
> etc., with a new document,

Can't do that, though, and I make no apologies.  I think it's a  

> I suppose if we had a
> single mapping of field names -> numbers in the index, that would gain
> us many of the above benefits?  Hmmm.

You'll still have to be able to remap field numbers when adding  
entire indexes.

> Here's one idea I just had: assuming there are no deletions, you can
> almost do a raw bytes copy from input segment to output (merged)
> segment of the postings for a given term X.  I think for prox postings
> you can.

You can probably squeeze out some nice gains using a skipVint()  
function, even with deletions.

> But for freq postings, you can't, because they are delta coded.

I'm working on this task right now for KS.

KS implements the "Flexible Indexing" paradigm, so all posting data  
goes in a single file.

I've applied an additional constraint to KS:  Every binary file must  
consist of one type of record repeated over and over.  Every indexed  
field gets its own dedicated posting file with the suffix .pNNN to  
allow per-field posting formats.

The I/O code is isolated in subclasses of a new class called  
"Stepper":  You can turn any Stepper loose on its file and read it  
from top to tail.  When the file format changes, Steppers will get  
archived, like old plugins.

My present task is to write the code for the Stepper subclasses  
MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can  
wait.)  As I write them, I will see if I can figure out format that  
can be merged as speedily as possible.  Perhaps the precise variant  
of delta encoding used in Lucene's .frq file should be avoided.

> Except: it's only the first entry of the incoming segments's freq
> postings that needs to be re-interpreted?  So you could read that one,
> encode the delta based on "last docID" for previous segment (I think
> we'd have to store this in index, probably only if termFreq >
> threshold), and then copyBytes the rest of the posting?  I will try
> this out on the merges I'm doing in LUCENE-843; I think it should
> work and make merging faster (assuming no deletes)?

Ugh, more special case code.

I have to say, I started trying to go over your patch, and the  
overwhelming impression I got coming back to this part of the Lucene  
code base in earnest for the first time since using 1.4.3 as a  
porting reference was: simplicity seems to be nobody's priority these  

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message