lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Tamm" <st...@salesforce.com>
Subject Optimization for IndexWriter.addIndexes()
Date Wed, 15 Mar 2006 21:39:24 GMT
One big performance problem with IndexWriter.addIndexes() is that it has
to optimize the index both before and after adding the segments.  When
you have a very large index, to which you are adding batches of small
updates, these calls to optimize make using addIndexes() impossible.  It
makes parallel updates very frustrating.

Here is an optimized function that helps out by calling mergeSegments
only on the newly added documents.  It will try to avoid calling
mergeSegments until the end, unless you're adding a lot of documents at
once.

I also have an extensive unit test that verifies that this function
works correctly if people are interested.  I gave it a different name
because it has very different performance characteristics which can make
querying take longer.

Feedback welcome,
-Steven

  /**
   * Merges all segments from an array of indexes into this index,
without
   * optimizing the index.  It does this by renaming the files in the
target
   * directories so have the appropriate segment names/numbers and then
   * merging them into the current index.
   *
   * <p>This may be used to parallelize batch indexing.  A large
document
   * collection can be broken into sub-collections.  Each sub-collection
can be
   * indexed in parallel, on a different thread, process or machine.
The
   * complete index can then be created by merging sub-collection
indexes
   * with this method.
   *
   * <p>After this completes, the index is unoptimized, but the indexes
   * from each directory passed in will be merged into one segment
before
   * adding to the main index.
   *
   */
   public synchronized void addIndexesNoOpt(Directory[] dirs)
     throws IOException {

       int curDocCount = docCount();    // Documents currently in the
index
       int addedDocs = 0;               // Documents added so far, not
in the index
       int start = segmentInfos.size(); // The position where segments
from other directories are added

       for (int i = 0; i < dirs.length; i++) {
         SegmentInfos sis = new SegmentInfos();   // read infos from dir
         sis.read(dirs[i]);

         for (int j = 0; j < sis.size(); j++) {
           SegmentInfo info = sis.info(j);
           segmentInfos.addElement(info);      // add each info
           addedDocs += info.docCount;         // Keep track of the
size.
         }
         // If we've increased the index size by 1/2, we should merge
segments now
         if (addedDocs * 2 > curDocCount && (curDocCount > 0)) {
             mergeSegments(start);
             curDocCount = docCount();
             addedDocs = 0;
             start = segmentInfos.size();
         }
       }

       // Merge in all segments not yet in the index.
       mergeSegments(start);

       // Make sure we're under the doc factor
       maybeMergeSegments();
  }


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message