lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: batch indexing
Date Fri, 09 Aug 2002 13:47:14 GMT
Doug Cutting wrote:

> [ I've moved this discussion to lucene-dev.  -drc ]
>
> Dmitry Serebrennikov wrote:
>
>> I was just thinking about doing something similar, but after looking 
>> at your code I thought couldn't the same thing be done by 
>> manipulating the mergeFactor on the existing IndexWriter? It already 
>> indexes n documents into memory before writing a new disk segment. I 
>> just looked at it again but I can't see without a detailed study 
>> whether the mergeFactor applies to merging from RAM to disk only or 
>> for merging on-disk segments as well. If it applies to both, perhaps 
>> we could add a different field to the IndexWriter to allow the two 
>> values to be different? Am I missing something?
>
>
> There'd be more to it than manipulating mergeFactor, but you're right, 
> IndexWriter already buffers indexes in a RAMDirectory.  And it would 
> be nice to use that RAMDirectory more extensively, controlled 
> separately from mergeFactor.  Peter achieves this by using two 
> IndexWriters, but probably this should be built-in to IndexWriter.
>
> Also, it would be better to limit things not based on the number of 
> documents indexed in RAM, but on the amount of memory used. 
> RAMDirectory could be altered to keep track of how big it is.  We 
> could, by default, allow buffering of up to, say, 10MB of index before 
> things are flushed.  This would speed up and reduce the number of 
> files when batch indexing.
>
> The changes to RAMDirectory should be straightforward.  New buffers 
> are only allocated in the flushBuffer implementation, and they're only 
> freed by deleteFile and renameFile.
>
> The changes to IndexWriter are more complicated.  The simplest 
> approach would be to just buffer all single-document indexes in memory 
> and merge them all at once when flushing.  However, merging thousands 
> of documents at once would use a lot of memory, and is probably not 
> acceptable.  So there probably needs to be two stacks of segment 
> indexes, one for those in memory and one for those on disk.  (The 
> latter is what is contained in the "segments" file.)
>
> Both could use an algorithm like that used by the current stack.  Each 
> time a new document is added, its index is pushed onto a stack, and 
> merging is attempted.  When segments are merged, the old indexes are 
> removed from the top of the stack, and the new index replaces them on 
> the top of the stack.
>
> The current algorithm to figure out when to merge is roughly:
>
>    int n = 1;
>    while (stack[top] through stack[top-mergeFactor]
>           all contain n or fewer documents) {
>      merge(stack[top] through stack[top-mergeFactor]);
>      n *= mergeFactor;
>    }
>
> The new case that would occur with two stacks is when the in-memory 
> index is flushed to disk as a single segment.  This could put a 
> segment on top of the disk-based stack that's larger than those 
> beneath it, which can't currently happen, and breaks the above 
> algorithm.  I think the fix is simple though:
>
>    int n = stack[top].numDocs;
>    while (stack[top] through stack[top-mergeFactor]
>           all contain n or fewer documents) {
>      merge(stack[top] through stack[top-mergeFactor]);
>      n *= mergeFactor;
>    }
>
> Does that look right?  The goals are to:
>   - keep only O(log(N)) segments, so searching is fast
>   - only merge each document O(log(N)) times, so indexing is fast
>
> Dmitry, are you interested in working on this? 

Well, frankly, I didn't realize how complicated this was... Yes, I am 
interested in working on this and it would be a great addition to 
Lucene, but there are other things in my work queue way ahead of this 
one. Peter's solution suddenly looks much more appealing :)

Dmitry.

>
>
> Doug
>
>
>
> -- 
> To unsubscribe, e-mail:   
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: 
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>




--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message