lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "ImproveIndexingSpeed" by MikeMcCandless
Date Thu, 19 Jul 2007 10:14:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by MikeMcCandless:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

The comment on the change is:
Adding further optimizations enabled by LUCENE-843

------------------------------------------------------------------------------
  
   More RAM before flushing means Lucene writes larger segments to begin with which means
less merging later.  Testing in [http://issues.apache.org/jira/browse/LUCENE-843 LUCENE-843]
found that around 48 MB is the sweet spot for that content set, but, your application could
have a different sweet spot.
  
-  * '''Increase [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMergeFactor(int)
mergeFactor], but not too much.'''
- 
-  Larger [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMergeFactor(int)
mergeFactors] defers merging of segments until later, thus speeding up indexing because merging
is a large part of indexing. However, this will slow down searching, and, you will run out
of file descriptors if you make it too large.  Values that are too large may even slow down
indexing since merging more segments at once means much more seeking for the hard drives.
- 
   * '''Turn off compound file format.'''
  
   Call [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)
setUseCompoundFile(false)]. Building the compound file format takes time during indexing (7-33%
in testing for [http://issues.apache.org/jira/browse/LUCENE-888 LUCENE-888]).  However, note
that doing this will greatly increase the number of file descriptors used by indexing and
by searching, so you could run out of file descriptors if mergeFactor is also large.
  
+  * '''Re-use Document and Field instances'''
+ 
+  As of Lucene 2.3 (not yet released) there are new setValue(...) methods that allow you
to change the value of a Field.  This allows you to re-use a single Field instance across
many added documents, which can save substantial GC cost.
+ 
+  It's best to create a single Document instance, then add multiple Field instances to it,
but hold onto these Field instances and re-use them by changing their values for each added
document.  For example you might have an idField, bodyField, nameField, storedField1, etc.
After the document is added, you then directly change the Field values (idField.setValue(...),
etc), and then re-add your Document instance.
+ 
+  Note that you cannot re-use a single Field instance within a Document, and, you should
not change a Field's value until the Document containing that Field has been added to the
index.  See [http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
Field] for details.
+ 
+ 
+  * '''Re-use a single Token instance in your analyzer'''
+ 
+  Analyzers often create a new Token for each term in sequence that needs to be indexed from
a Field.  You can save substantial GC cost by re-using a single Token instance instead.
+ 
+ 
+  * '''Use the char[] API in Token instead of the String API to represent token Text'''
+ 
+  As of Lucene 2.3 (not yet released), a Token can represent its text as a slice into a char
array, which saves the GC cost of new'ing and then reclaiming String instances.  By re-using
a single Token instance and using the char[] API you can avoid new'ing any objects for each
term.  See [http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html Token]
for details. 
+ 
+ 
+  * '''Use autoCommit=false when you open your IndexWriter'''
+ 
+  In Lucene 2.3 (not yet released), there are substantial optimizations for Documents that
use stored fields and term vectors, to save merging of these very large index files.  You
should see the best gains by using autoCommit=false for a single long-running session of IndexWriter.
 Note however that searchers will not see any of the changes flushed by this IndexWriter until
it is closed; if that is important you should stick with autoCommit=true instead or periodically
close and re-open the writer.
  
   * '''Instead of indexing many small text fields, aggregate the text into a single "contents"
field and index only that (you can still store the other fields).'''
+ 
+  * '''Increase [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMergeFactor(int)
mergeFactor], but not too much.'''
+ 
+  Larger [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMergeFactor(int)
mergeFactors] defers merging of segments until later, thus speeding up indexing because merging
is a large part of indexing. However, this will slow down searching, and, you will run out
of file descriptors if you make it too large.  Values that are too large may even slow down
indexing since merging more segments at once means much more seeking for the hard drives.
  
   * '''Turn off any features you are not in fact using.'''
  

Mime
View raw message