zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: zookeeper on ec2
Date Tue, 07 Jul 2009 16:25:14 GMT
Henry Robinson wrote:
> Effectively, EC2 does not introduce any new failure modes but potentially
> exacerbates some existing ones. If a majority of EC2 nodes fail (in the
> sense that their hard drive images cannot be recovered), there is no way to
> restart the cluster, and persistence is lost. As you say, this is highly
> unlikely. If, for some reason, the quorums are set such that only a single
> node failure could bring down the quorum (bad design, but plausible), this
> failure is more likely.

This is not strictly true. The cluster cannot recover _automatically_ if 
failures > n, where ensemble size is 2n+1. However you can recover 
manually as long as at least 1 snap and trailing logs can be recovered. 
We can even recover if the latest snapshots are corrupted, as long as we 
can recover a snap from some previous time t and all logs subsequent to t.

> EC2 just ups the stakes - crash failures are now potentially more dangerous
> (bugs, packet corruption, rack local hardware failures etc all could cause
> crash failures). It is common to assume that, notwithstanding a significant
> physical event that wipes a number of hard drives, writes that are written
> stay written. This assumption is sometimes false given certain choices of
> filesystem. EC2 just gives us a few more ways for that not to be true.
> I think it's more possible than one might expect to have a lagging minority
> left behind - say they are partitioned from the majority by a malfunctioning
> switch. They might all be lagging already as a result. Care must be taken
> not to bring up another follower on the minority side to make it a majority,
> else there are split-brain issues as well as the possibility of lost
> transactions. Again, not *too* likely to happen in the wild, but these
> permanently running services have a nasty habit of exploring the edge
> cases...
>> To be explicit, you can cause any ZK cluster to back-track in time by doing
>> the following:
> ...
>> f) add new members of the cluster
> Which is why care needs to be taken that the ensemble can't be expanded with
> a current quorum. Dynamic membership doesn't save us when a majority fails -
> the existence of a quorum is a liveness condition for ZK. To help with the
> liveness issue we can sacrifice a little safety (see, e.g. vector clock
> ordered timestamps in Dynamo), but I think that ZK is aimed at safety first,
> liveness second. Not that you were advocating changing that, I'm just
> articulating why correctness is extremely important from my perspective.
> Henry
>> At this point, you will have lost the transactions from (b), but I really,
>> really am not going to worry about this happening either by plan or by
>> accident.  Without steps (e) and (f), the cluster will tell you that it
>> knows something is wrong and that it cannot elect a leader.  If you don't
>> have *exact* coincidence of the survivor set and the set of laggards, then
>> you won't have any data loss at all.
>> You have to decide if this is too much risk for you.  My feeling is that it
>> is OK level of correctness for conventional weapon fire control, but not
>> for
>> nuclear weapons safeguards.  Since my apps are considerably less sensitive
>> than either of those, I am not much worried.
>> On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson <henry@cloudera.com>
>> wrote:
>>> It seems like there is a
>>> correctness issue: if a majority of servers fail, with the remaining
>>> minority lagging the leader for some reason, won't the ensemble's current
>>> state be forever lost?

View raw message