lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harald Kirsch <kir...@ebi.ac.uk>
Subject Re: Most efficient way to index 14M documents (out of memory/file handles)
Date Wed, 07 Jul 2004 07:33:21 GMT
On Tue, Jul 06, 2004 at 10:44:40PM -0700, Kevin A. Burton wrote:
> I'm trying to burn an index of 14M documents.
> 
> I have two problems.
> 
> 1.  I have to run optimize() every 50k documents or I run out of file 
> handles.  this takes TIME and of course is linear to the size of the 
> index so it just gets slower by the time I complete.  It starts to crawl 
> at about 3M documents.

Recently I indexed roughly this many documents. I separated the whole
thing first into 100 jobs (we happen to have that many machines in the
cluster.-) each indexing its share into its own index. I used
mergeFactor=100 and only optimized just before closing the index.

Then I merged them all into one index simply by

  writer.mergeFactor = 150; 
  writer.addIndexes(dirs);

I was surprised myself that it went through easily within under two
hours for each of the 101 indexes. The documents have, however, only
three fields.

  Maybe this helps,
  Harald.

-- 
------------------------------------------------------------------------
Harald Kirsch | kirsch@ebi.ac.uk | +44 (0) 1223/49-2593

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message