lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: batch indexing
Date Thu, 08 Aug 2002 22:18:03 GMT
[ I've moved this discussion to lucene-dev.  -drc ]

Dmitry Serebrennikov wrote:
> I was just thinking about doing something similar, but after looking at 
> your code I thought couldn't the same thing be done by manipulating the 
> mergeFactor on the existing IndexWriter? It already indexes n documents 
> into memory before writing a new disk segment. I just looked at it again 
> but I can't see without a detailed study whether the mergeFactor applies 
> to merging from RAM to disk only or for merging on-disk segments as 
> well. If it applies to both, perhaps we could add a different field to 
> the IndexWriter to allow the two values to be different? Am I missing 
> something?

There'd be more to it than manipulating mergeFactor, but you're right, 
IndexWriter already buffers indexes in a RAMDirectory.  And it would be 
nice to use that RAMDirectory more extensively, controlled separately 
from mergeFactor.  Peter achieves this by using two IndexWriters, but 
probably this should be built-in to IndexWriter.

Also, it would be better to limit things not based on the number of 
documents indexed in RAM, but on the amount of memory used. 
RAMDirectory could be altered to keep track of how big it is.  We could, 
by default, allow buffering of up to, say, 10MB of index before things 
are flushed.  This would speed up and reduce the number of files when 
batch indexing.

The changes to RAMDirectory should be straightforward.  New buffers are 
only allocated in the flushBuffer implementation, and they're only freed 
by deleteFile and renameFile.

The changes to IndexWriter are more complicated.  The simplest approach 
would be to just buffer all single-document indexes in memory and merge 
them all at once when flushing.  However, merging thousands of documents 
at once would use a lot of memory, and is probably not acceptable.  So 
there probably needs to be two stacks of segment indexes, one for those 
in memory and one for those on disk.  (The latter is what is contained 
in the "segments" file.)

Both could use an algorithm like that used by the current stack.  Each 
time a new document is added, its index is pushed onto a stack, and 
merging is attempted.  When segments are merged, the old indexes are 
removed from the top of the stack, and the new index replaces them on 
the top of the stack.

The current algorithm to figure out when to merge is roughly:

    int n = 1;
    while (stack[top] through stack[top-mergeFactor]
           all contain n or fewer documents) {
      merge(stack[top] through stack[top-mergeFactor]);
      n *= mergeFactor;
    }

The new case that would occur with two stacks is when the in-memory 
index is flushed to disk as a single segment.  This could put a segment 
on top of the disk-based stack that's larger than those beneath it, 
which can't currently happen, and breaks the above algorithm.  I think 
the fix is simple though:

    int n = stack[top].numDocs;
    while (stack[top] through stack[top-mergeFactor]
           all contain n or fewer documents) {
      merge(stack[top] through stack[top-mergeFactor]);
      n *= mergeFactor;
    }

Does that look right?  The goals are to:
   - keep only O(log(N)) segments, so searching is fast
   - only merge each document O(log(N)) times, so indexing is fast

Dmitry, are you interested in working on this?

Doug



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message