lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Appropriate disk optimization for large index?
Date Mon, 18 Aug 2008 20:54:09 GMT

mattspitz wrote:

> So, my indexing is done in "rounds", where I pull a bunch of  
> documents from
> the database, index them, and flush them to disk.  I manually call  
> "flush()"
> because I need to ensure that what's on disk is accurate with what  
> I've
> pulled from the database.
> On each round, then, I flush to disk.  I set the buffer such that it  
> doesn't
> flush any segments until I manually call flush(), so as to incur I/O  
> only
> once each "round"

Make sure once you upgrade to 2.4 (or trunk) that you switch to  
commit() instead of flush() because flush() doesn't sync the index  
files, so if the hardware or OS crashes your index will not match  
what's in the DB (and/or may become corrupt).

I'm not sure which of seek time vs throughput is best to optimize in  
your IO system.  On flushing a segment you'd likely want the fastest  
throughput, assuming the filesystem is able to assign many adjacent  
blocks to the files being flushed.  During merging (and optimize) I  
think seek time is most important, because Lucene reads from 50 (your  
mergeFactor) files at once and then writes to one or two files.  But,  
this (at least normal merging) is typically done concurrently with  
adding documents, so the time consumed may not matter in the net  
runtime of the overall indexing process.  When a flush happens during  
a merge, seek time is likely most important.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message