Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 47772 invoked from network); 16 Mar 2010 01:19:47 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Mar 2010 01:19:47 -0000 Received: (qmail 11953 invoked by uid 500); 16 Mar 2010 01:19:47 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 11936 invoked by uid 500); 16 Mar 2010 01:19:47 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 11928 invoked by uid 99); 16 Mar 2010 01:19:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 01:19:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of maxime.caron@gmail.com designates 209.85.223.197 as permitted sender) Received: from [209.85.223.197] (HELO mail-iw0-f197.google.com) (209.85.223.197) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 01:19:40 +0000 Received: by iwn35 with SMTP id 35so430490iwn.2 for ; Mon, 15 Mar 2010 18:19:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=4n4i3+cboOp0JWO8QfHOfC2TZI8OuGRiDevT5F0IbsI=; b=XKdfyGsd90HaWlQPxWPDc6jK8ZfvSzv7f5JwlqbivcyfZHJd0P4r+Iv+peDgLg2efp y5c0zpn8j68YVm5hi9Yu3lEja06TNlXMehVEOzDtfPKGmNoZ2qDXDfk5+jH0wR36jGvc /TjnG5hESzo+wK+EDcTPYX53r/K5R/u2QGvOo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=QZ0gzlQKXOzY5Sv41EAGkeUXOnEYvyLVwfyBIHKiCF7x+arrJSiJUiE6ZIMSi3T+in d6aY+QW+wvYeMuQEqO9obfs8w3omuSEMn4LEKDmTw2duJSjSy0Quj4xQznEbtDuJYjzv 6VRREOrPGxgWLQVA/nX015zoAVeXv+v+l2D6Y= MIME-Version: 1.0 Received: by 10.231.161.138 with SMTP id r10mr707933ibx.34.1268702359078; Mon, 15 Mar 2010 18:19:19 -0700 (PDT) In-Reply-To: References: <9dba5ed31003151747p48c2892fv57bf94eb4075d1e7@mail.gmail.com> Date: Mon, 15 Mar 2010 21:19:19 -0400 Message-ID: <9dba5ed31003151819y24162dbdn83865db2608ed711@mail.gmail.com> Subject: Re: persistent storage and node recovery From: Maxime Caron To: zookeeper-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636c5971b7048a80481e0c995 X-Virus-Checked: Checked by ClamAV on apache.org --001636c5971b7048a80481e0c995 Content-Type: text/plain; charset=ISO-8859-1 Thanks a lots it's much clearer now. When i say "more replicas" i don't mean the number of node but the number of copy of an item value. This was my misunderstanding because in Scalaris the item value is replicated when node join and leave the DHT. So this is all about the "operation log" so if a node is in minority but have more recent committed value this node is in Veto over the other node. This is where Zookeeper differ from scalaris because Scalaris dont have "Operation log". So if i understood well , zookeeper have a better consistency model at the price of not being built on a DHT. I wonder if the two can be mixed to get the advantage of both. On 15 March 2010 20:56, Henry Robinson wrote: > Hi Maxime - > > I'm not very familiar with Scalaris, but I can answer for the ZooKeeper > side > of things. > > ZooKeeper servers log each operation to a persistent store before they vote > on the outcome of that operation. So if a vote passes, we know that a > majority of servers has written that operation to disk. Then, if a node > fails and restarts, it can read all the committed operations from disk. As > long as a majority of nodes is still working, at least one of them will > have > seen all the committed operations. > > If we didn't do this, the loss of a majority of servers (even if they > restarted) could mean that updates are lost. But ZooKeeper is meant to be > durable - once a write is made, it will persist for the lifetime of the > system if it is not overwritten later. So in order to properly tolerate > crash failures and not lose any updates, you have to make sure a majority > of > servers write to disk. > > There is no possibility of more replicas being in the system than are > allowed - you start off with a fixed number, and never go above it. > > Hope this helps - let me know if you have any further questions! > > Henry > > -- > Henry Robinson > Software Engineer > Cloudera > 415-994-6679 > > On 15 March 2010 16:47, Maxime Caron wrote: > > > Hi everybody, > > > > From what i understand Zookeeper consistency model work the same way as > > does > > Scalaris. > > Which is to keep the majority of the replica for an item UP. > > > > In Scalaris i > > > > f a single failed node does crash and recover, it simply start like a > fresh > > new node and all data is lost. > > > > This is the case because it may otherwise get some inconsistencies as > > another node already took over. > > > > For a short timeframe there might be more replicas in the system than > > allowed, which destroys the proper functioning of our majority based > > algorithms. > > > > So my question is how Zookeeper use the persistent storage during node > > recovery, how does the > > > > majority based algorithms is different so consistency is preserved. > > > > > > Thanks a lots > > > > Maxime Caron > > > --001636c5971b7048a80481e0c995--