lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Indexing Tips and Hints
Date Tue, 25 Feb 2003 17:29:29 GMT
These sort of tricks can help things some if index i/o is really your 
bottleneck.  Are you convinced that it is?  When i/o is a bottleneck the 
CPU typically spends a large portion of its time idle.  Do you see this?

 From your description (indexing ~300k 5k documents takes over 24 hours) 
I would be very surprised if index i/o is your bottleneck.  Rather I 
would might suspect the XML parsing or somesuch.

In general, Lucene's default settings are designed to give good 
performance.  If pumping up some parameter made a huge performance 
improvement with little other impact then it would be pumped up by 
default.  Increasing the mergeFactor speeds things somewhat, but it also 
causes more file handles to be used.

When Karl talks of "flushing" a RAM-based index to disk, I suspect he's 
using IndexWriter.addIndexes().  Reading his message, I'd be surprised 
if his performance is really much better than it would be if he just set 
mergeFactor to 50 and then optimized the index just once at the end, and 
that is a lot less work.


Michael Barry wrote:
> Thanks for all the info. I've been working on streamlining my indexing 
> and I've finally
> found the message from last year that intrigued me
> (

> In that message, karl øie suggests
> 1. use a ramdir, and mutliple fsdirs
> 2. merge the fsdirs into a single fsdir
> 3. use threads
> (Of course he provides more details.)
> I have a question concerning RAMDirectories - is there any benefit using 
> them over setting the
> mergeFactor higher? Also, I notice a lot of  advice to use 
> RAMDirectories but not much verbage on
> how to use them effectively.
> In the above msg from Karl, he suggests writing to a RAMDirectory and 
> then at
> some point flush the RAMDirectory to an FSDirectory. Anyone have any 
> code to illuminate
> that? It's the "flushing" part that's getting me. Is flushing just 
> calling list() on the
> RAMDirectory and then deleteFile() each one? Originally I was just 
> creating a new
> RAMDirectory each time I needed one (not the best approach but it does 
> work).
> I know I should spend time profiling the code and see exactly where the 
> bottle necks
> occur and I will do that but I'd like to get a good handle on the 
> multiple ways to
> index also.
> Thanks for your time, Mike.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message