lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <>
Subject Re: Best strategy for reindexing large amount of data
Date Wed, 07 Oct 2009 18:35:32 GMT

Depending on the hardware available you can use a Hadoop cluster
to reindex more quickly. With Amazon EC2 one can spin up several
nodes, reindex, then tear them down when they're no longer
needed. Also you can simply update in place the existing
documents in the index, though you'd need to be careful not to
overload the server with indexing calls such that queries would
not be responsive. Number 3 (batches) could be used to create an
index on the side (like a Solr master), record deletes into a
file, then merge the newly created index in, apply deletes, then
commit to see the changes.

There's advantages and disadvantages to each strategy.


On Wed, Oct 7, 2009 at 11:15 AM, Maarten_D <> wrote:
> Hi,
> I've searched the mailinglists and documentation for a clear answer to the
> following question, but haven't found one, so here goes:
> We use Lucene to index and search a constant stream of messages: our index
> is always growing. In the past, if we added new features to the software
> that required the index to be rebuilt (adopting an accent-insensitive
> analyzer for instance, or adding a field to every lucene Document), we would
> build an entirely new index out of all the messages we had stored, and then
> swap out the old one with the new one. Recently, we've had a couple of
> clients whose message stores are so large that our strategy is no longer
> viable: building a new index from scratch takes, for various reasons not
> related to lucene, upwards of 48 hours, and that period will only increase
> when client message stores grow bigger and bigger.
> What I would like is to update the index piecemeal, starting with the most
> recently added document (ie the most recent messages, since clients usually
> care about those the most). Then, most of the users will see the new
> functionality in their searches fairly quickly, and the older stuff, which
> doesn't matter so much, will get reindexed at a later date. However, I'm
> unclear as to what would be the best/most performant way to accomplish this.
> There are a few strategies I've thought of, and I was wondering if anyone
> could help me out as to which would be the best idea (or if there are other,
> better methods that I haven't thought of). I should also say that every
> message in the system has a unique identifier (guid) that can be used to see
> whether two different lucene documents represent the same message.
> 1. Simply iterate over all message in the message store, convert them to
> lucene documents, and call IndexWriter.update() for each one (using the
> guid).
> 2. Iterate over all messages in small steps (say 1000 at a time), and the
> for each batch delete the existing documents from the index, and then do
> Indexwriter.insert() for all messages (this is essentially step 1, split up
> into small parts and with the delete and insert part batched).
> 3. Iterate over all messages in small steps, and for each batch create a
> separate index (lets say a RAM index), delete all the old documents from the
> main index, and merge the seperate index into the main one.
> 4. Same as 3, except merge first, and then remove the old duplicates.
> Any help on this issue would be much appreciated.
> Thanks in advance,
> Maarten
> --
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message