lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Most efficient way to index 14M documents (out of memory/file handles)
Date Wed, 07 Jul 2004 17:15:31 GMT
Julien,

Thanks for the excellent explanation.

I think this thread points to a documentation problem.  We should 
improve the javadoc for these parameters to make it easier for folks to

In particular, the javadoc for mergeFactor should mention that very 
large values (>100) are not recommended, since they can run into file 
handle limitations with FSDirectory.  The maximum number of open files 
while merging is around mergeFactor * (5 + number of indexed fields). 
Perhaps mergeFactor should be tagged an "Expert" parameter to discourage 
folks playing with it, as it is such a common source of problems.

The javadoc should instead encourage using minMergeDocs to increase 
indexing speed by using more memory.  This parameter is unfortunately 
poorly named.  It should really be called something like maxBufferedDocs.

Doug

Julien Nioche wrote:
> It is not surprising that you run out of file handles with such a large
> mergeFactor.
> 
> Before trying more complex strategies involving RAMDirectories and/or
> splitting your indexation on several machines, I reckon you should try
> simple things like using a low mergeFactor (eg: 10) combined with a higher
> minMergeDocs (ex: 1000) and optimize only at the end of the process.
> 
> By setting a higher value to minMergeDocs, you'll index and merge with a
> RAMDirectory. When the limit is reached (ex 1000) a segment is written in
> the FS. MergeFactor controls the number of segments to be merged, so when
> you have 10 segments on the FS (which is already 10x1000 docs), the
> IndexWriter will merge them all into a single segment. This is equivalent to
> an optimize I think. The process continues like that until it's finished.
> 
> Combining theses parameters should be enough to achieve good performance.
> The good point of using minMergeDocs is that you make a heavy use of the
> RAMDirectory used by your IndexWriter (== fast) without having to be too
> careful with the RAM (which would be the case with RamDirectory). At the
> same time keeping your mergeFactor low limits the risks of too many handles
> problem.
> 
> 
> ----- Original Message ----- 
> From: "Kevin A. Burton" <burton@newsmonster.org>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Wednesday, July 07, 2004 7:44 AM
> Subject: Most efficient way to index 14M documents (out of memory/file
> handles)
> 
> 
> 
>>I'm trying to burn an index of 14M documents.
>>
>>I have two problems.
>>
>>1.  I have to run optimize() every 50k documents or I run out of file
>>handles.  this takes TIME and of course is linear to the size of the
>>index so it just gets slower by the time I complete.  It starts to crawl
>>at about 3M documents.
>>
>>2.  I eventually will run out of memory in this configuration.
>>
>>I KNOW this has been covered before but for the life of me I can't find
>>it in the archives, the FAQ or the wiki.
>>
>>I'm using an IndexWriter with a mergeFactor of 5k and then optimizing
>>every 50k documents.
>>
>>Does it make sense to just create a new IndexWriter for every 50k docs
>>and then do one big optimize() at the end?
>>
>>Kevin
>>
>>-- 
>>
>>Please reply using PGP.
>>
>>    http://peerfear.org/pubkey.asc
>>
>>    NewsMonster - http://www.newsmonster.org/
>>
>>Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>>       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
>>GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>>  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message