lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Clifton" <rclif...@cj.com>
Subject RE: Indexing very large sets (10 million docs)
Date Mon, 28 Jul 2003 21:11:34 GMT
Doug,

You seem to by implying that it is possible to optimize very large indexes.  My index has
a couple million records, but more importantly it's about 40 gigs in size.  I have tried many
times to optimize it and this always results in hitting the Linux file size limit.  Is there
 a way to get around this?  I have the merge factor and max merge docs set, but the optimization
process seems to ignore those fields.  

I have just been accepting the slower (unoptimized) search times as a fact of life..

-----Original Message-----
From: Doug Cutting [mailto:cutting@lucene.com]
Sent: Monday, July 28, 2003 10:28 AM
To: Lucene Users List
Subject: Re: Indexing very large sets (10 million docs)


Armbrust, Daniel C. wrote:
> If you set your mergeFactor back down to something closer to the default (10) - you probably
wouldn't have any problems with file handles.  The higher you make it, the more open files
you will have.  When I set it at 90 for performance reasons, I would run out of file handles
(on XP) somewhere after 100,000 documents.  So, I simply create a new index every 100,000
documents.  This way, I get the best of both worlds.  Performance while indexing from a relatively
high mergefactor, and non-excessive file handle usage (without calling optimize), from closing
the index before it gets huge.  I am never recopying the index over itself until the last
step, when I merge all of the indexes into one master index.  

I have used a mergeFactor of 50 with three indexed fields to build 
indexes with in excess of 5 million documents without incident and with 
good performance.

The maximum number of open files while indexing is:
   (7 + indexedFields) * mergeFactor
So with a mergeFactor of 50 and three indexed fields, indexing should 
never open more than 500 files at once, which is well beneath the 
default Linux limit.

For batched indexing I recommend: (1) increasing mergeFactor somewhat, 
depending on how many indexed fields you have; (2) adding all of your 
documents; and (3) optimizing once at the end.

In my opinion, and by design, intermediate optimizations are a hassle 
that provide little advantage, as Lucene does this automatically for 
you.  Lots of folks on this list have recommended complex strategies 
combining very large mergeFactors with explicit intermediate index 
optimization, but I encourage you to avoid these techniques and use the 
simple strategy described here.

It is important to optimize large indexes: it makes searches much 
faster, and vastly reduces the number of file handles required when 
searching.  But, as other posters have noted, optimization is expensive, 
thus indexes should only be optimized when indexing is complete. 
However for rapidly changing collections, deciding when to optimize is 
more tricky and application dependent...

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message