lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "ImproveIndexingSpeed" by MikeMcCandless
Date Sat, 09 Jun 2007 16:25:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by MikeMcCandless:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

------------------------------------------------------------------------------
  
   * '''Flush by RAM usage instead of document count.'''
  
-  Call writer.ramSizeInBytes() after every added doc then call flush() when it's using too
much RAM.  This is especially good if you have small docs or  highly variable doc sizes. 
You need to first set maxBufferedDocs large enough to prevent the writer from flushing based
on document count.  However, don't set it too large otherwise you may hit [http://issues.apache.org/jira/browse/LUCENE-845
LUCENE-845].  Somewhere around 2-3X your "typical" flush count should be OK.
+  Call [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#ramSizeInBytes()
writer.ramSizeInBytes()] after every added doc then call [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#flush()
flush()] when it's using too much RAM.  This is especially good if you have small docs or
 highly variable doc sizes.  You need to first set [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxBufferedDocs(int)
maxBufferedDocs] large enough to prevent the writer from flushing based on document count.
 However, don't set it too large otherwise you may hit [http://issues.apache.org/jira/browse/LUCENE-845
LUCENE-845].  Somewhere around 2-3X your "typical" flush count should be OK.
  
   * '''Use as much RAM as you can afford.'''
  
@@ -17, +17 @@

  
   * '''Increase mergeFactor, but not too much.'''
  
-  Larger mergeFactors defers merging of segments until later, thus speeding up indexing because
merging is a large part of indexing. However, this will slow down searching, and, you will
run out of file descriptors if you make it too large.  Values that are too large may even
slow down indexing since merging more segments at once means much more seeking for the hard
drives.
+  Larger [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMergeFactor(int)
mergeFactors] defers merging of segments until later, thus speeding up indexing because merging
is a large part of indexing. However, this will slow down searching, and, you will run out
of file descriptors if you make it too large.  Values that are too large may even slow down
indexing since merging more segments at once means much more seeking for the hard drives.
  
   * '''Turn off compound file format.'''
  
-  Building the compound file format takes time during indexing (7-33% in testing for [http://issues.apache.org/jira/browse/LUCENE-888
LUCENE-888]).  However, note that doing this will greatly increase the number of file descriptors
use by indexing and by searching, so you could run out of file descriptors if mergeFactor
is also large.
+  Call [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)
setUseCompoundFile(false)]. Building the compound file format takes time during indexing (7-33%
in testing for [http://issues.apache.org/jira/browse/LUCENE-888 LUCENE-888]).  However, note
that doing this will greatly increase the number of file descriptors used by indexing and
by searching, so you could run out of file descriptors if mergeFactor is also large.
  
  
   * '''Instead of indexing many small text fields, aggregate the text into a single "contents"
field and index only that (you can still store the other fields).'''
@@ -32, +32 @@

  
   * '''Use a faster analyzer.'''
  
-  Sometimes analysis of a document takes alot of time. For example, StandardAnalyzer is quite
time consuming.  If you can get by with a simpler analyzer, then try it.
+  Sometimes analysis of a document takes alot of time. For example, !StandardAnalyzer is
quite time consuming.  If you can get by with a simpler analyzer, then try it.
  
   * '''Speed up document construction.'''
  
@@ -40, +40 @@

  
   * '''Don't optimize unless you really need to (for faster searching).'''
  
-  * '''Use multiple threads with one IndexWriter.'''
+  * '''Use multiple threads with one !IndexWriter.'''
  
   Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory architectures,
native command queuing in hard drives, etc.) so using more than one thread to add documents
can give good gains overall.  Even on older machines there is often still concurrency to be
gained between IO and CPU.  Test the number of threads to find the best performance point.
  

Mime
View raw message