lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Steffensen <>
Subject Re: Avoid losing data on ZK connection-loss/session-timeout
Date Tue, 21 Aug 2012 14:17:05 GMT
Per Steffensen wrote:
> *Mark Miller:*
> *Ad 3)* Well, we can do some practical things right? I don't think we 
> need to support a node coming back from the dead a year later and it 
> had some updates the cluster doesn't have. A node coming up 2 minutes 
> later is something we want to worry about though.
A year, no, but 2 minutes is way to low a limit.

There can be many reasons for running with replica - e.g.
a) High Availability wrt updating docs and wrt search
b) Handling of higher search volumen
c) Have a "live" backup so that you dont (as easily) lose data

Dont know the Solr design-criteria behind the new "4.0 kind of" 
replication, but if there is a c)-criteria hidden there somewhere, 2 
minutes is not enough.

A valid scenario is that you run with 1 replica (2 shards per slice) and 
expect not to lose data as long as no more than one disk is crashed (in 
overlapping periods). So lets say the disk on the machine running the 
leader of this slice crashes. Allowing a "behind" replica to continue as 
leader shortly after, and therefore opening up for new updates to the 
slice, data only on the old leader will be lost forever. Depending on 
preferences between "accepting data loss" and "accepting down time 
(where new updates cannot take place)" (basically between a) and c) 
above) an admin of such a system might expect to have a fair chance of 
making the disk work again or at least dig the data out of it and put it 
onto a new machine and configure that to participate in the Solr 
cluster. In such a case 2 minutes is way to little.

Another valid scenario. Same setup a first scenario, but this time the 
Solr JVM running the leader just crashes or the motherboard/CPU burns or 
something. You are now in a posistion where you still have the newest 
data, it is just not "online". Again an admin, depending on preferences, 
might want a "behind" replica not to take over leadership and allow 
updates. As soon as you allow updates to the new leader (old replica) 
and there where data on the old leader that was not yet replicated to 
the replica, you are dead - you havent necessarily lost data but you 
have put yourself in a position where you cannot ever reconstruct 
correct dataset (and thats basically the same as losing data).
> So basically we either need something timing based or admin command 
> based that lets you start a cold shard (slice :-) ) and each node 
> waits around for X amount of time or until command X is received, and 
> then leader election begins.
I like this approach, where you, depending on preferences, can setup the 
system so that a replica is allowed to take over leadership after X 
minutes even though it knows that it is "behind" (a replica is always 
allowed to take over leadership immidiately if it knows it is not 
"behind"), but where you can also setup up your system so that it 
requires an admin "acceptance" for this to happen.

Some systems (potentially including the one Im currently working on :-) 
) might not prefere HA over "not losing data".

I think the following should be done
- Increase the likelihood of replica not being behind. With the current 
implementation, in case of a sudden crash, the likelihood of a replica 
being behind is way too big. Some kind of atomicity between leader and 
replica writing to transaction-log or at least "committing" the changes 
to the transaction-log is needed
- A common knowledge among shards in the same slice about "current 
newest version of slice" would be very beneficial. E.g. leader writes 
"newest version" to ZK every time he writes to (or commits to) 
transaction-log. The writing/committing to leader transaction-log and 
writing to ZK also needs to be as atomic as possible.
- With the two steps above
-- a replica will be able to know if it is behind and therefore if it 
should wait (a period of time or for an admin "acceptance") before 
taking over leadership
-- and the chance that such a situation where a replica is actually 
behind has been minimized.
> *Jan H√łydahl:*
> ElasticSearch has some settings to control when recovery starts after 
> cluster restart, see Guide. This approach looks reasonable. If we know 
> that we expect N nodes in our cluster we can start recovery when we 
> see N nodes up. If fewer than N nodes up, we wait for X time (running 
> on local data, not accepting new updates) before recovery and leader 
> election starts.
As you might know we used to use ElasticSearch in my current project. 
For political reasons we decided to stop using ES and move to Solr. I 
was very much against that decision, not because of Solr (I didnt know 
much about it at that point in time), but because ES was actually very 
cool and functioning very well (for a "before v1.0  piece of software"). 
But that particular feature/approach that Jan mentions was not one of 
the "cool things" about ES (along with its automatic shard re-location - 
uhhhh still having nightmares).

View raw message