lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael D. Curtin" <>
Subject Re: Speedup indexing process
Date Fri, 17 Feb 2006 15:39:04 GMT
Java Programmer wrote:

> Hi,
> Maybe this question is trivial but I need to ask it. I've some problem with
> indexing large number of documents, and I seek for better solution.
> Task is to index about 33GB text data CSV (each record about 30kB), it
> possible of course to index these data but I'm not very happy with timings
> (about 26 hours), so I want to know how can i speed up this process. First I
> think about splitting CVS file into smaller ones, eg 5GB and index them on 6
> indexing computers, but now is my question - can I join such parts into one
> index after indexing jobs on each computer is finished? I saw example wit
> RAMDirectory which could be merged with
> FSDirectory, but this example was about same IndexWriter, in my case I need
> some separate IndexWriters on few computers. So does it possible with
> Lucene?

Here are some things you can try.  First, look at IndexWriter.mergeFactor and 
IndexWriter.minMergeDocs.  These two attributes control how often IndexWriter 
actually writes a batch of indexed documents to disk (and therefore how big 
each disk piece is), and how many disk pieces get merged together.  Since each 
merge is essentially a big read and re-write of indexed documents, the fewer 
times you do it, the shorter your indexing time.  On the other hand, merging 
less often takes more RAM.  In other words, it's another incarnation of the 
classic tradeoff between space and time.  As one data point, one of my 
applications has documents about 10% the size of yours (about 3K each). 
Changing minMergeDocs to 70000 and mergeFactor to 70 cut its indexing time by 
more than half.

Another approach is along the lines of what you mentioned.  Index subsets of 
the data on several machines, then merge them all together at the end (that 
part has to be done on 1 machine).  See IndexWriter.addIndexes().

Of course, it's also possible that parsing the documents themselves is a big 
chunk of your time.  If you're using your own Analyzer, or your data is 
unusual in some way, you might look at that too.

Note that these approaches are not mutually exclusive, i.e. you can combine 
them.  Good luck!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message