lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: How can we know if 2 lucene indexes are same?
Date Fri, 05 Sep 2008 12:33:11 GMT

Shalin Shekhar Mangar wrote:

> Let me try to explain.
> I have a master where indexing is done. I have multiple slaves for  
> querying.
> If I commit+optimize on the master and then rsync the index, the data
> transferred on the network is huge. An alternate way is to commit on  
> master,
> transfer the delta to the slave and issue an optimize on the slave.  
> This is
> very fast because less data is transferred on the network.

Large segment merges will also send huge traffic.  You may just want  
to send all updates (document adds/deletes) to all slaves directly?   
It'd be nice if you could somehow NOT sync the effects of segment  
merging, but do sync doc add/deletes... not sure how to do that.

> However, we need to ensure that the index on the slave is actually  
> in sync
> with the master. So that on another commit, we can blindly transfer  
> the
> delta to the slave.

I assume your app ensures that no deltas arrive to the slave while  
it's running optimize?  So then the question becomes (I think) "if two  
indices are identical to begin with, and I separately run optimize on  
each, will the resulting two optimized indices be identical?".

By "in sync" you also require the final segment name (after optimize)  
is identical right?

I think the answer is yes, but I'm not certain unless I think more  
about it.  Also this behavior is not "promised" in Lucene's API.

Merges for optimize are now allowed to run concurrently (by default,  
with ConcurrentMergeScheduler), except for the final (< mergeFactor  
segments) merge, which waits until others have finished.  So if there  
are 7 obvious merges necessary to optimize, 3 will run concurrently,  
while 4 wait.  Those 4 then run as the merges finish over time, which  
may happen in different orders for each index and so different merges  
may run.  Then the final merge will run and I *think* the net number  
of merges that ran should always be the same and so the final segment  
name should be the same.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message