lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Fastest batch indexing with 1.3-rc1
Date Thu, 21 Aug 2003 04:41:03 GMT
Leo Galambos wrote:
> Isn't it better for Dan to skip the optimization phase before merging? I 
> am not sure, but he could save some time on this (if he has enough file 
> handles for that, of course).

It depends.  If you have 10 machines, each with a single disk, that you 
use for indexing in parallel, and copy all of the indexes to a single 
machine for the final merge, then you're probably better off optimizing 
each index before copying it and merging it with the others, in order to 
maximize the amount of work done in parallel, using all disk spindles. 
However, if instead you have one machine with ten processors and a 
filesystem striped across ten disks, then, in theory, optimizing before 
merging might not help much, since the single-threaded final merge could 
use all ten disks at once.  Even then, though the final merge would be 
doing some CPU work serially which would have been done in parallel in 
the first configuration.  In general I think it's best to do as much 
work as possible in parallel.

 > What strategy do you use in "nutch"?

Nutch builds optimized indexes for each fetched "segment" (n.b., a Nutch 
segment is different than a Lucene segment) and only merges segment 
indexes as the final step before deploying them for searching.  Nutch 
has a rolling set of active segments: the oldest are periodically 
discarded and replaced with newly fetched segments.  Before a new set of 
segments is deployed, duplicate elimination processing must occur, which 
marks duplicates as deleted prior to merging new production indexes.


View raw message