zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahadev Konar <maha...@yahoo-inc.com>
Subject Re: question about ZK robustness
Date Wed, 01 Dec 2010 04:07:38 GMT
I meant to say, we can wait a while before we are done moving to the new wiki tree.


On 11/30/10 7:24 PM, "Mahadev Konar" <mahadev@yahoo-inc.com> wrote:

Hi Ted,
  Good Summary. I think this should defintiely go onto wiki. Do you want to take a first stab
at creating a wiki page? I''d suggest waiting for while, so that


On 11/20/10 2:28 PM, "Ted Dunning" <ted.dunning@gmail.com> wrote:

I was just asked a very cogent question of the form "how do you know" and
would like somebody who knows better than I do to confirm or deny my
response.  The only part that I am absolutely sure of is the part at the end
where I say "No doubt I have omitted something".  With an edit from Ben,
this probably should become a wiki page.

Here is the conversation.  Please mark my errors or elisions if you can.

 You have to educate me in how ZK does data-integrity checking to avoid
> propagating accidental data-corruption (eg, from an ext3 bug, faulty drive,
> etc). We might have to augment ZK to add that ability if it doesn't already.

There are several mechanisms at the application API layer, in ZK internals
and in the snapshot and transaction log formats.

At the application layer, all updates are atomic and all update the entire
contents.  If you provide a version number with the update instead of -1,
then your update will only succeed if the version being updated matches that

All updates are strictly serialized so there are no race conditions on
updates.  This makes lots of things simpler inside of ZK.

All updates go through the ZK master and are only committed when replicated
to a quorum of the cluster.  The quorum is ceil((n+1)/2) where n is the
configured number of servers.  Committed means that the update has been
flushed to the transaction log on disk.  The replication of updates is from
master memory to slave memory rather than master memory to disk to slave.

All logs and snapshots have application level CRC's and don't depend on disk
ECC for correctness.

At cluster start, logs are examined to determine the latest transaction that
was committed correctly.  At least a quorum must have the latest update
because of the update semantics and the strong CRC's prevent a partially
written transaction from being read as being acceptable.  I don't know how
the last update id is stored exactly.

Reads can give stale data for short periods of time, but will always give a
coherently updated picture of what the master knew at some point in the
past.  If a client connection is not lost, then the clients view of the last
update id is monotonic increasing.  If a client loses a connection and
reconnects it is conceivable that after reconnecting, they will see an
earlier version of the universe, but this is exceedingly unlikely because
the time required to reconnect is typically longer than how far out of date
any ZK cluster member can typically be.  You can cause this if you connect
to a ZK cluster member, partition the cluster so that your connected server
is in the majority, update and then connect to a cluster member that is
separated from the master of the ZK cluster.  You can always do a sync for
force the server you are using to catch up to the master (at least for the
moment).  The use of sync will force a monotonic view of time regardless of
any connection/disconnection/reconnection scenario.

Since all updates must go through the cluster master, updates cannot happen
in the cluster split brain scenarios and the strong serialization guarantees
for all updates are maintained.

The only data loss scenarios I have heard of in about 3 years of watching
all ZK mailing list traffic and all bug reports have had to do with the
theoretical possibility of problems due to disk controllers that lie about
when data is persisted.  I have seen cluster failures where memory is
exhausted, but I can't remember any that caused loss of data.  There was one
bug where somehow a cluster member remembered a transaction id that was one
past the last transaction that had been committed to the logs.  This was a
very strange coincidence of very aggressive operator error and a bug which
has been fixed.  The cluster refused to restart in this case, but the fix
was simply to delete the log on the confused machine and restart the
cluster.  No data was lost.

I have, no doubt, omitted something.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message