Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EE32318793 for ; Mon, 5 Oct 2015 17:59:07 +0000 (UTC) Received: (qmail 45504 invoked by uid 500); 5 Oct 2015 17:59:01 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 45455 invoked by uid 500); 5 Oct 2015 17:59:01 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 45443 invoked by uid 99); 5 Oct 2015 17:59:01 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Oct 2015 17:59:01 +0000 Received: from mail-qg0-f50.google.com (mail-qg0-f50.google.com [209.85.192.50]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id CE9E41A05E9 for ; Mon, 5 Oct 2015 17:59:00 +0000 (UTC) Received: by qgx61 with SMTP id 61so156800937qgx.3 for ; Mon, 05 Oct 2015 10:58:59 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.140.237.200 with SMTP id i191mr43207475qhc.5.1444067939665; Mon, 05 Oct 2015 10:58:59 -0700 (PDT) Received: by 10.140.43.66 with HTTP; Mon, 5 Oct 2015 10:58:59 -0700 (PDT) Received: by 10.140.43.66 with HTTP; Mon, 5 Oct 2015 10:58:59 -0700 (PDT) In-Reply-To: References: <079862CD-6BCD-4186-B5AD-60FDD70F2881@apache.org> Date: Mon, 5 Oct 2015 18:58:59 +0100 Message-ID: Subject: RE: 3-server Zab cluster From: Flavio P JUNQUEIRA To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=001a11359b029352ff05215f4473 --001a11359b029352ff05215f4473 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Indeed, I meant to say quorum. -Flavio On 5 Oct 2015 6:30 pm, "Ibrahim El-sanosi (PGR)" < i.s.el-sanosi@newcastle.ac.uk> wrote: > Hi Flavio, > > > >That's not accurate. Being recorded by a quorum guarantees that a txn > will be in the initial state of future epochs, but a prospective leader > might have txns it its log that haven't been recorded in *a log*. The > ?>prospective leader needs to make sure that such txns are recorded in a > quorum before establishing a new epoch, though. > > I guess you meant a quorum not a LOG in above world *log* !!! > > Thank you > > Ibrahim > > -----Original Message----- > From: Flavio Junqueira [mailto:fpj@apache.org] > Sent: Monday, October 05, 2015 06:23 =D9=85 > To: user@zookeeper.apache.org? > Subject: Re: 3-server Zab cluster > > > > On 05 Oct 2015, at 18:13, Ibrahim El-sanosi (PGR) < > i.s.el-sanosi@newcastle.ac.uk> wrote: > > > > Hi Rakesh, > > > > In Zab, before the end of synchronization phase, new leader will not > commit any proposals in transaction logs that have not got a majority of > acks from pervious ensemble (that what you are saying). > > That's not accurate. Being recorded by a quorum guarantees that a txn wil= l > be in the initial state of future epochs, but a prospective leader might > have txns it its log that haven't been recorded in a log. The prospective > leader needs to make sure that such txns are recorded in a quorum before > establishing a new epoch, though. > > > I think what Zab does is that before the end of synchronization phase, > in L and F2 (the new quorum), L (a prospective leader) will sync its own > state with F2 as the initial state. Referring to my scenario, zxid =3D10= is > part of the initial state and as a result it will be delivered in new > quorum (L and F2) before processing new proposals of new epoch. > > Yes, this is right. > > > > > You can read this thread > > http://zookeeper-user.578899.n2.nabble.com/Zab-Failure-scenario-td7581 > > 583.html > > > 1583.html> for more info > > > > What do you think? Does anyone have any questions or concerns about suc= h > (small) optimization? > > I'm not entirely sure what the optimization is and if you are proposing a > change or what. Are you looking for a blessing from this community? I'd > like to understand what you're trying to achieve. > > -Flavio > > > > > Ibrahim > > > > From: Rakesh Radhakrishnan [mailto:rakeshr.apache@gmail.com > > ] > > Sent: Thursday, October 01, 2015 06:15 =D9=85 > > To: Ibrahim El-sanosi (PGR) > > Subject: Re: 3-server Zab cluster > > > >>>>>>>>> (***) Ok, I thought when F2 form a quorum with L and before > serving clients, L synchronizes its state with F2, resulting in zxid=3D10 > will be committed in L and F2 as well. I also though this process is the > same as Zab, isn't it? > > > > Since L didn't receives any ACK responses from F1 or F2 before leaving > the Leader status previously, L won't commit transaction zxid=3D10. IIUC > after re-forming the new quorum L will not have any mechanism to > re-initiate the proposal(Active messaging phase) for the previous zxid=3D= 10. > > > > -Rakesh > > > > On Thu, Oct 1, 2015 at 10:19 PM, Ibrahim El-sanosi (PGR) < > i.s.el-sanosi@newcastle.ac.uk > i.s.el-sanosi@newcastle.ac.uk>>> wrote: > > Thank you Rakesh. > > > >>>> In your case, zk client sees a successful response from F1. Then > assume F2 >>>joins quorum first and L become the leader again. But the > newly formed >>>quorum will not have the zxid=3D10 transaction. This will > make the cluster >>>inconsistent, isn't it? > > > > (***) Ok, I thought when F2 form a quorum with L and before serving > clients, L synchronizes its state with F2, resulting in zxid=3D10 will be > committed in L and F2 as well. I also though this process is the same as > Zab, isn't it? > > > > > >>>> Apart from the above case I'm not seeing any other problems with 3 > node >>>cluster. The above data loss case can be avoided by putting an > assumption >>>that more than a tolerated number of server failures may > affect the cluster >>>consistency and results in data loss. > > > > Yes, if the solution above (***) is not correct, you assumption makes > sense. > > > > Ibrahim > > > > From: Rakesh Radhakrishnan [mailto:rakeshr.apache@gmail.com > > > >] > > Sent: 01 October 2015 17:26 > > To: user@zookeeper.apache.org > > > >; Ibrahim El-sanosi (PGR) > > > > Subject: Re: 3-server Zab cluster > > > > Hi Ibrahim, > > > > Below example taken from your older mail thread. > > > >>>>>> 1. leader (L) sends a proposal p with zxid =3D10 to F1 and F2. > >>>>>> 2. F1 logs, sends an ACK, commits, replays to clients and > >>>>>> crashes. F2 crashes before receiving P10. L has not received any > >>>>>> ACKs > > > > My thoughts for the above scenario is, > > > > In your case, zk client sees a successful response from F1. Then assume > F2 joins quorum first and L become the leader again. But the newly formed > quorum will not have the zxid=3D10 transaction. This will make the cluste= r > inconsistent, isn't it? > > > > Apart from the above case I'm not seeing any other problems with 3 node > cluster. The above data loss case can be avoided by putting an assumption > that more than a tolerated number of server failures may affect the clust= er > consistency and results in data loss. But I feel this optimization would > have more cases if we scale up the cluster size beyond 3 servers. Now, I'= m > not thinking in that direction as your case is limited to 3 node cluster. > > > > Regards, > > Rakesh > > > > > > On Tue, Sep 29, 2015 at 2:28 PM, Ibrahim El-sanosi (PGR) < > i.s.el-sanosi@newcastle.ac.uk > i.s.el-sanosi@newcastle.ac.uk>>> wrote: > > Yes Alex, in my post I mentioned that this (small) optimization can onl= y > work with 3-servers cluster. > > > > Who could confirm the optimization can work? > > > > Ibrahim > > > > -----Original Message----- > > From: Alexander Shraer [mailto:shralex@gmail.com > > > >] > > Sent: Tuesday, September 29, 2015 12:11 =D8=B5 > > To: user@zookeeper.apache.org > > > > > > Subject: Re: 3-server Zab cluster > > > > I'm not 100% sure whether operations that were pending on the leader ar= e > sent out during sync when this leader looses quorum and re-elected. If so= , > then maybe you're right. But in any case, this would not work for 5 or mo= re > servers... > > > > On Mon, Sep 28, 2015 at 3:51 PM, Ibrahim El-sanosi (PGR) < > i.s.el-sanosi@newcastle.ac.uk > i.s.el-sanosi@newcastle.ac.uk>>> wrote: > > > >> Thank you Alex for replaying. > >> > >> When you said " the leader gets re-elected and the operation is > >> truncated from logs at other servers". I though the new leader will > >> sync the its logs with other followers (synchronization phase), > >> resulting in the operation will commit by new quorum. Let me make the > scenarios as steps: > >> > >> 1. leader (L) sends a proposal p with zxid =3D10 to F1 and F2. > >> 2. F1 logs, sends an ACK, commits, replays to clients and crashes. F2 > >> crashes before receiving P10. L has not received any ACKs > >> > >> Possible solution (1) > >> The leader will move to LOOKING phase as there is no quorum > >> supporting its leadership. Now Assume F2 wakes up. F2 forms a quorum > >> with the L (pervious leader), L becomes new leader again as it has > latest zxid (10) in its log. > >> L syncs its state with F2, as a result L, F1 (before crashing) and F2 > >> commit P10. Is that correct? > >> > >> Possible solution (2) > >> The leader will move to LOOKING phase as there is no quorum > >> supporting its leadership. Now Assume F1 (with Zxid =3D10 committed) > >> wakes up. I am not sure who should be a leader (F1 with Zxid =3D10 > >> committed or L (pervious > >> leader) with Zxid =3D 10 logged), I think F1 become a new leader as it > >> has Zxid =3D 10 committed. F1 forms a quorum with the L (pervious > >> leader), F1 becomes new leader as it has latest zxid (10) . L (new > >> leader) syncs its state with L (pervious leader now become a > >> follower), as a result Zxid10 commits by new quorum. Is that correct? > >> > >> What do you think? > >> > >> Ibrahim > >> > >> > >> > >> > >> > >> -----Original Message----- > >> From: Alexander Shraer [mailto:shralex@gmail.com > >> >> >] > >> Sent: Monday, September 28, 2015 07:27 =D9=85 > >> To: user@zookeeper.apache.org > >> >> > > >> Cc: dev@zookeeper.apache.org > >> >> > > >> Subject: Re: 3-server Zab cluster > >> > >> Committing locally when sending an ACK at a server would lead to loss > >> of consistency - it is possible that this is the only server that > >> acks, e.g., this server is temporarily disconnected from the leader, > >> the leader gets re-elected and the operation is truncated from logs > >> at other servers. Its ok to ACK it but its not ok to commit since > >> this exposes this to users as a committed operation that they can see. > >> > >> On Mon, Sep 28, 2015 at 4:19 AM, Ibrahim El-sanosi (PGR) < > >> i.s.el-sanosi@newcastle.ac.uk > i.s.el-sanosi@newcastle.ac.uk>>> wrote: > >> > >>> In Zab, assume we have a cluster consists of 3-servers. To deliver a > >>> write request, it must run 3 communication steps proposal, > >>> acknowledgement and commit. > >>> As Zab uses reliable FIFO, it is possible to remove commit round. As > >>> soon as a follower receives a proposal, it logs, sends an ACK and > >>> commits locally. Upon receiving ACK from any follower, leader > >>> commits a proposal locally, no COMMIT message need to be sent to > >>> followers. In this case, all servers commit a proposal in two > >>> round-trips, resulting in reducing latency particularly in followers. > >>> > >>> Note that this optimization can only work in 3-servers cluster > >>> (follower reaches a majority as soon as it acks). > >>> Does anyone see any problems with such (small) optimization? > >>> Ibrahim > > --001a11359b029352ff05215f4473--