lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [Jakarta Lucene Wiki] New: PainlessIndexing
Date Tue, 20 Jul 2004 21:39:16 GMT
   Date: 2004-07-20T14:39:16
   Editor: JulienNioche <>
   Wiki: Jakarta Lucene Wiki
   Page: PainlessIndexing

   hint for indexing with lucene

New Page:

IndexWriter has a useful method called (at least temporarily) '''setMinMergeDocs'''
that should be used in order to avoid file handles problems and reduce
indexing time.

File handles problem is often due to the fact that people use large '''mergeFactor''' 
values in order to speed up indexation.  The maximum number of open files while merging is
around mergeFactor * (5 + number of indexed fields), 
which can be too much for the FSDirectory.

By setting a higher value to '''minMergeDocs''', you'll index and merge with a
RAMDirectory which is internally used by the IndexWriter. When the limit set by '''minMergeDocs'''
is reached (ex 1000) a segment is written in
the FS. '''mergeFactor''' controls the number of segments to be merged, so when
you have 10 segments on the FS (which is already 10x1000 docs), the
IndexWriter will merge them all into a single segment. This is equivalent to
an optimize I think. The process continues like that until it's finished.

Combining these parameters should be enough to achieve good performance.
The good point of using '''minMergeDocs''' is that you make a heavy use of the
RAMDirectory used by your IndexWriter (== fast) without having to be too
careful with the RAM (which would be the case with RAMDirectory). At the
same time keeping your mergeFactor low, limits the risk of too many file handles

<hint given by JulienNioche>

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message