lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "ImproveIndexingSpeed" by DanielNaber
Date Sun, 03 Feb 2008 22:18:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by DanielNaber:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

The comment on the change is:
update for lucene 2.3

------------------------------------------------------------------------------
  
   * '''Open a single writer and re-use it for the duration of your indexing session.'''
  
-  * '''Flush by RAM usage instead of document count.'''
+  * '''Lucene <= 2.2: Flush by RAM usage instead of document count.'''
  
   Call [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#ramSizeInBytes()
writer.ramSizeInBytes()] after every added doc then call [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#flush()
flush()] when it's using too much RAM.  This is especially good if you have small docs or
 highly variable doc sizes.  You need to first set [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxBufferedDocs(int)
maxBufferedDocs] large enough to prevent the writer from flushing based on document count.
 However, don't set it too large otherwise you may hit [http://issues.apache.org/jira/browse/LUCENE-845
LUCENE-845].  Somewhere around 2-3X your "typical" flush count should be OK.
  
@@ -34, +34 @@

  
   * '''Re-use Document and Field instances'''
  
-  As of Lucene 2.3 (not yet released) there are new setValue(...) methods that allow you
to change the value of a Field.  This allows you to re-use a single Field instance across
many added documents, which can save substantial GC cost.
+  As of Lucene 2.3 there are new setValue(...) methods that allow you to change the value
of a Field.  This allows you to re-use a single Field instance across many added documents,
which can save substantial GC cost.
  
   It's best to create a single Document instance, then add multiple Field instances to it,
but hold onto these Field instances and re-use them by changing their values for each added
document.  For example you might have an idField, bodyField, nameField, storedField1, etc.
After the document is added, you then directly change the Field values (idField.setValue(...),
etc), and then re-add your Document instance.
  
@@ -48, +48 @@

  
   * '''Use the char[] API in Token instead of the String API to represent token Text'''
  
-  As of Lucene 2.3 (not yet released), a Token can represent its text as a slice into a char
array, which saves the GC cost of new'ing and then reclaiming String instances.  By re-using
a single Token instance and using the char[] API you can avoid new'ing any objects for each
term.  See [http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html Token]
for details. 
+  As of Lucene 2.3, a Token can represent its text as a slice into a char array, which saves
the GC cost of new'ing and then reclaiming String instances.  By re-using a single Token instance
and using the char[] API you can avoid new'ing any objects for each term.  See [http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html
Token] for details. 
  
  
   * '''Use autoCommit=false when you open your IndexWriter'''
  
-  In Lucene 2.3 (not yet released), there are substantial optimizations for Documents that
use stored fields and term vectors, to save merging of these very large index files.  You
should see the best gains by using autoCommit=false for a single long-running session of IndexWriter.
 Note however that searchers will not see any of the changes flushed by this IndexWriter until
it is closed; if that is important you should stick with autoCommit=true instead or periodically
close and re-open the writer.
+  In Lucene 2.3 there are substantial optimizations for Documents that use stored fields
and term vectors, to save merging of these very large index files.  You should see the best
gains by using autoCommit=false for a single long-running session of IndexWriter.  Note however
that searchers will not see any of the changes flushed by this IndexWriter until it is closed;
if that is important you should stick with autoCommit=true instead or periodically close and
re-open the writer.
  
   * '''Instead of indexing many small text fields, aggregate the text into a single "contents"
field and index only that (you can still store the other fields).'''
  

Mime
View raw message