zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@yahoo-inc.com>
Subject Re: question on ZAB protocol
Date Wed, 13 Jul 2011 12:50:56 GMT
Hi Yang, Some answers between lines:

On Jul 13, 2011, at 3:45 AM, Yang wrote:

> the leader broadcasts a write to all replicas, and then waits for a
> quorum to reply, before sending out the COMMIT.
> why is the quorum necessary (i.e. why can't the leader just wait for
> one reply and start sending the COMMIT?)??
> now that I think about it, it seems that waiting for just one reply is
> enough, because the connection from leader to replicas are FIFO, as
> long as the replicas do not die,
> they will eventually get the writes, even though the writes arrive at
> them after the leader starts the COMMIT.

Just to clarify, FIFO guarantees that a prefix of what has been sent  
is received and that the messages follow the send order. Delivery is  
not guaranteed, though. If we deliver no message, FIFO is still  

> the only reason I can think of  for using a quorum is to tolerate more
> failures: if the only replied replica  dies, and leader dies, then we
> lose that  latest write.

Committing a transaction based on a single acknowledgement does not  
guarantee the durability of the operation in general if you want to  
tolerate at least one crash. Once a client receives a confirmation of  
an operation, enough ZooKeeper replicas must have persisted the  
corresponding update. Enough in this case implies that any subset of  
replicas used to recover upon a leader crash must have at least one  
server containing that update.

Now, I think that your optimization works only in the special case  
that the client that issued the operation is connected to a follower  
and we have 3 replicas, assuming majority quorums. If the client is  
connected to the leader directly, then it doesn't work. It also  
doesn't work if we need to tolerate at least 2 crashes, independent of  
the status of the replica the client is connected to.

> by requiring f ACKs, you can tolerate f-1 failures. but then you don't
> really need 2f+1 nodes in the ZK cluster, just f+1 is enough.

This derivation might be obvious, but let me put it down for  
completeness at least. Say that we use at least q servers to recover  
upon a leader crash. To ensure progress, we need at least "n >= q + f"  
replicas, where f is a threshold on the number of simultaneous crashes  
you're willing to tolerate. If "q <= f", then we can't guarantee that  
transactions committed during a recovery step are durable, since  
another subsequent recovery step might not have any replica in common  
with the previous one. Consequently, we have that q is at least f+1,  
which gives us "n >= 2f + 1"

Does it make sense?


View raw message