zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@cloudera.com>
Subject Re: persistent storage and node recovery
Date Tue, 16 Mar 2010 00:56:23 GMT
Hi Maxime -

I'm not very familiar with Scalaris, but I can answer for the ZooKeeper side
of things.

ZooKeeper servers log each operation to a persistent store before they vote
on the outcome of that operation. So if a vote passes, we know that a
majority of servers has written that operation to disk. Then, if a node
fails and restarts, it can read all the committed operations from disk. As
long as a majority of nodes is still working, at least one of them will have
seen all the committed operations.

If we didn't do this, the loss of a majority of servers (even if they
restarted) could mean that updates are lost. But ZooKeeper is meant to be
durable - once a write is made, it will persist for the lifetime of the
system if it is not overwritten later. So in order to properly tolerate
crash failures and not lose any updates, you have to make sure a majority of
servers write to disk.

There is no possibility of more replicas being in the system than are
allowed - you start off with a fixed number, and never go above it.

Hope this helps - let me know if you have any further questions!


Henry Robinson
Software Engineer

On 15 March 2010 16:47, Maxime Caron <maxime.caron@gmail.com> wrote:

> Hi everybody,
> From what i understand Zookeeper consistency model work the same way as
> does
> Scalaris.
> Which is to keep the majority of the replica for an item UP.
> In Scalaris i
> f a single failed node does crash and recover, it simply start like a fresh
> new node and all data is lost.
> This is the case because it may otherwise get some inconsistencies as
> another node already took over.
> For a short timeframe there might be more replicas in the system than
> allowed, which destroys the proper functioning of our majority based
> algorithms.
> So my question is how Zookeeper use the persistent storage during node
> recovery, how does the
> majority based algorithms is different so consistency is preserved.
> Thanks a lots
> Maxime Caron

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message