zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@cloudera.com>
Subject Re: zookeeper on ec2
Date Mon, 06 Jul 2009 23:50:41 GMT
On Mon, Jul 6, 2009 at 10:16 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> No.  This should not cause data loss.

> As soon as ZK cannot replicate changes to a majority of machines, it
> refuses
> to take any more changes.  This is old ground and is required for
> correctness in the face of network partition.  It is conceivable (barely)
> that *exactly* the minority that were behind were the survivors, but this
> is
> almost equivalent to a complete failure of the cluster choreographed in
> such
> a way that a few nodes come back from the dead just afterwards.  That could
> cause the state to not include some "completed" transactions to disappear,
> but at this level of massive failure, we have the same issues with any
> cluster.

Effectively, EC2 does not introduce any new failure modes but potentially
exacerbates some existing ones. If a majority of EC2 nodes fail (in the
sense that their hard drive images cannot be recovered), there is no way to
restart the cluster, and persistence is lost. As you say, this is highly
unlikely. If, for some reason, the quorums are set such that only a single
node failure could bring down the quorum (bad design, but plausible), this
failure is more likely.

EC2 just ups the stakes - crash failures are now potentially more dangerous
(bugs, packet corruption, rack local hardware failures etc all could cause
crash failures). It is common to assume that, notwithstanding a significant
physical event that wipes a number of hard drives, writes that are written
stay written. This assumption is sometimes false given certain choices of
filesystem. EC2 just gives us a few more ways for that not to be true.

I think it's more possible than one might expect to have a lagging minority
left behind - say they are partitioned from the majority by a malfunctioning
switch. They might all be lagging already as a result. Care must be taken
not to bring up another follower on the minority side to make it a majority,
else there are split-brain issues as well as the possibility of lost
transactions. Again, not *too* likely to happen in the wild, but these
permanently running services have a nasty habit of exploring the edge

> To be explicit, you can cause any ZK cluster to back-track in time by doing
> the following:

> f) add new members of the cluster

Which is why care needs to be taken that the ensemble can't be expanded with
a current quorum. Dynamic membership doesn't save us when a majority fails -
the existence of a quorum is a liveness condition for ZK. To help with the
liveness issue we can sacrifice a little safety (see, e.g. vector clock
ordered timestamps in Dynamo), but I think that ZK is aimed at safety first,
liveness second. Not that you were advocating changing that, I'm just
articulating why correctness is extremely important from my perspective.


> At this point, you will have lost the transactions from (b), but I really,
> really am not going to worry about this happening either by plan or by
> accident.  Without steps (e) and (f), the cluster will tell you that it
> knows something is wrong and that it cannot elect a leader.  If you don't
> have *exact* coincidence of the survivor set and the set of laggards, then
> you won't have any data loss at all.
> You have to decide if this is too much risk for you.  My feeling is that it
> is OK level of correctness for conventional weapon fire control, but not
> for
> nuclear weapons safeguards.  Since my apps are considerably less sensitive
> than either of those, I am not much worried.

> On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson <henry@cloudera.com>
> wrote:
> > It seems like there is a
> > correctness issue: if a majority of servers fail, with the remaining
> > minority lagging the leader for some reason, won't the ensemble's current
> > state be forever lost?
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message