lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armbrust, Daniel C." <Armbrust.Dan...@mayo.edu>
Subject RE: Indexing very large sets (10 million docs)
Date Mon, 28 Jul 2003 16:16:20 GMT
We are currently doing something similar here.

We have upwards of 15 million documents in our index.

There has been a lot of discussion on this in the past... But I'll give a few details:

My current techniques for indexing very large amounts of data is to 

Set the merge factor to 90
Leave the maxMergeDocs alone (lucene default)

Index 100,000 (the number you use depends on the filesize/number of fields in the documents...
If you have a lot of fields, you will run out of file handles sooner I believe) documents
into an index - no optimize, just index them in.  Close the index.  Open a new index, write
100,000 docs into this one, etc.  Continue.  If you have more than one machine/more than one
disk drive, running these processes in parallel helps a lot.

After you have a whole bunch of indexes of size 100,000 documents, then do a merge of all
of the indexes.  You never have to call optimize - it will be optimized when it is merged.

I have written an entire wrapper around lucene which handles all of this for me... Though
I did it at work and I don't know if I can release the code.  I think a couple of other people
have done similar things, and have released the code.

Dan

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message