Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <AM2PR07MB0481D00973795C42B42C76A2A1480@AM2PR07MB0481.eurprd07.prod.outlook.com>
References: 
 <AM2PR07MB048102D9843788BE432211D6A14F0@AM2PR07MB0481.eurprd07.prod.outlook.com>
	<CANcXBFPwjgbdcUtwTNUhix0CD-fE-sMfTcAG1Tc6Jdr6BMQ0AA@mail.gmail.com>
	<AM2PR07MB04814721CD07A869A5EEB7FFA14F0@AM2PR07MB0481.eurprd07.prod.outlook.com>
	<CANcXBFP+KRcxw_YYU2RS7aoQ4CB1o8B4FsDvKR53=CLai4B3dw@mail.gmail.com>
	<AM2PR07MB04811EF9D5823729778ADFA7A14E0@AM2PR07MB0481.eurprd07.prod.outlook.com>
	<CAHB_t6qxKP-kjrk2-3CjV_BgBO2ELb3RNWDUiHYZxtwhSE008Q@mail.gmail.com>
	<AM2PR07MB048112BF76499B6B75FF5114A14C0@AM2PR07MB0481.eurprd07.prod.outlook.com>
	<CAHB_t6qSaiWjBSk45O9r0y6HAmcX1KsVXJeR0xZx9E8m9_6a0Q@mail.gmail.com>
	<AM2PR07MB048102758D292C7E21108B9AA1480@AM2PR07MB0481.eurprd07.prod.outlook.com>
	<079862CD-6BCD-4186-B5AD-60FDD70F2881@apache.org>
	<AM2PR07MB0481D00973795C42B42C76A2A1480@AM2PR07MB0481.eurprd07.prod.outlook.com>
Date: Mon, 5 Oct 2015 18:58:59 +0100
Message-ID: 
 <CAB5oV289DVNpCCxvJHpqvh-yDyPz7SP=uQw-7CBKJbM0FcMN3g@mail.gmail.com>
Subject: RE: 3-server Zab cluster
From: Flavio P JUNQUEIRA <fpj@apache.org>
To: user@zookeeper.apache.org
Content-Type: multipart/alternative; boundary=001a11359b029352ff05215f4473

--001a11359b029352ff05215f4473
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Indeed, I meant to say quorum.

-Flavio
On 5 Oct 2015 6:30 pm, "Ibrahim El-sanosi (PGR)" <
i.s.el-sanosi@newcastle.ac.uk> wrote:

> Hi Flavio,
>
>
> >That's not accurate. Being recorded by a quorum guarantees that a txn
> will be in the initial state of future epochs, but a prospective leader
> might have txns it its log that haven't been recorded in *a log*. The
> ?>prospective leader needs to make sure that such txns are recorded in a
> quorum before establishing a new epoch, though.
>
> I guess you meant a quorum not a LOG in above world *log* !!!
>
> Thank you
>
> Ibrahim
>
> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@apache.org]
> Sent: Monday, October 05, 2015 06:23 =D9=85
> To: user@zookeeper.apache.org?
> Subject: Re: 3-server Zab cluster
>
>
> > On 05 Oct 2015, at 18:13, Ibrahim El-sanosi (PGR) <
> i.s.el-sanosi@newcastle.ac.uk> wrote:
> >
> > Hi Rakesh,
> >
> > In Zab, before the end of synchronization phase, new leader will not
> commit any proposals in transaction logs that have not got a majority of
> acks from pervious ensemble  (that what you are saying).
>
> That's not accurate. Being recorded by a quorum guarantees that a txn wil=
l
> be in the initial state of future epochs, but a prospective leader might
> have txns it its log that haven't been recorded in a log. The prospective
> leader needs to make sure that such txns are recorded in a quorum before
> establishing a new epoch, though.
>
> > I think what Zab does is that before the end of synchronization phase,
> in L and F2 (the new quorum), L (a prospective leader) will sync its own
> state with F2 as the initial state.  Referring to my scenario, zxid =3D10=
 is
> part of the initial state and as a result it will be delivered in new
> quorum (L and F2) before  processing new proposals of new epoch.
>
> Yes, this is right.
>
> >
> > You can read this thread
> > http://zookeeper-user.578899.n2.nabble.com/Zab-Failure-scenario-td7581
> > 583.html
> > <http://zookeeper-user.578899.n2.nabble.com/Zab-Failure-scenario-td758
> > 1583.html> for more info
> >
> > What do you think? Does anyone have any questions or concerns about suc=
h
> (small) optimization?
>
> I'm not entirely sure what the optimization is and if you are proposing a
> change or what. Are you looking for a blessing from this community? I'd
> like to understand what you're trying to achieve.
>
> -Flavio
>
> >
> > Ibrahim
> >
> > From: Rakesh Radhakrishnan [mailto:rakeshr.apache@gmail.com
> > <mailto:rakeshr.apache@gmail.com>]
> > Sent: Thursday, October 01, 2015 06:15 =D9=85
> > To: Ibrahim El-sanosi (PGR)
> > Subject: Re: 3-server Zab cluster
> >
> >>>>>>>>> (***) Ok, I thought when F2 form a quorum with L and  before
> serving clients, L synchronizes its state with F2, resulting in zxid=3D10
> will be committed in L and F2 as well. I also though this process is the
> same as Zab, isn't it?
> >
> > Since L didn't receives any ACK responses from F1 or F2 before leaving
> the Leader status previously, L won't commit transaction zxid=3D10. IIUC
> after re-forming the new quorum L will not have any mechanism to
> re-initiate the proposal(Active messaging phase) for the previous zxid=3D=
10.
> >
> > -Rakesh
> >
> > On Thu, Oct 1, 2015 at 10:19 PM, Ibrahim El-sanosi (PGR) <
> i.s.el-sanosi@newcastle.ac.uk <mailto:i.s.el-sanosi@newcastle.ac.uk
> ><mailto:i.s.el-sanosi@newcastle.ac.uk <mailto:
> i.s.el-sanosi@newcastle.ac.uk>>> wrote:
> > Thank you Rakesh.
> >
> >>>> In your case, zk client sees a successful response from F1. Then
> assume F2 >>>joins quorum first and L become the leader again. But the
> newly formed >>>quorum will not have the zxid=3D10 transaction. This will
> make the cluster >>>inconsistent, isn't it?
> >
> > (***) Ok, I thought when F2 form a quorum with L and  before serving
> clients, L synchronizes its state with F2, resulting in zxid=3D10 will be
> committed in L and F2 as well. I also though this process is the same as
> Zab, isn't it?
> >
> >
> >>>> Apart from the above case I'm not seeing any other problems with 3
> node >>>cluster. The above data loss case can be avoided by putting an
> assumption >>>that more than a tolerated number of server failures may
> affect the cluster >>>consistency and results in data loss.
> >
> > Yes, if the solution above (***) is not correct, you assumption makes
> sense.
> >
> > Ibrahim
> >
> > From: Rakesh Radhakrishnan [mailto:rakeshr.apache@gmail.com
> > <mailto:rakeshr.apache@gmail.com><mailto:rakeshr.apache@gmail.com
> > <mailto:rakeshr.apache@gmail.com>>]
> > Sent: 01 October 2015 17:26
> > To: user@zookeeper.apache.org
> > <mailto:user@zookeeper.apache.org><mailto:user@zookeeper.apache.org
> > <mailto:user@zookeeper.apache.org>>; Ibrahim El-sanosi (PGR)
> >
> > Subject: Re: 3-server Zab cluster
> >
> > Hi Ibrahim,
> >
> > Below example taken from your older mail thread.
> >
> >>>>>> 1. leader  (L)  sends a proposal p with zxid =3D10 to F1 and F2.
> >>>>>> 2. F1 logs, sends an ACK, commits, replays to clients and
> >>>>>> crashes. F2 crashes before receiving P10. L has not received any
> >>>>>> ACKs
> >
> > My thoughts for the above scenario is,
> >
> > In your case, zk client sees a successful response from F1. Then assume
> F2 joins quorum first and L become the leader again. But the newly formed
> quorum will not have the zxid=3D10 transaction. This will make the cluste=
r
> inconsistent, isn't it?
> >
> > Apart from the above case I'm not seeing any other problems with 3 node
> cluster. The above data loss case can be avoided by putting an assumption
> that more than a tolerated number of server failures may affect the clust=
er
> consistency and results in data loss. But I feel this optimization would
> have more cases if we scale up the cluster size beyond 3 servers. Now, I'=
m
> not thinking in that direction as your case is limited to 3 node cluster.
> >
> > Regards,
> > Rakesh
> >
> >
> > On Tue, Sep 29, 2015 at 2:28 PM, Ibrahim El-sanosi (PGR) <
> i.s.el-sanosi@newcastle.ac.uk <mailto:i.s.el-sanosi@newcastle.ac.uk
> ><mailto:i.s.el-sanosi@newcastle.ac.uk <mailto:
> i.s.el-sanosi@newcastle.ac.uk>>> wrote:
> > Yes Alex, in my post I mentioned that this (small) optimization can onl=
y
> work with 3-servers cluster.
> >
> > Who could confirm the optimization can work?
> >
> > Ibrahim
> >
> > -----Original Message-----
> > From: Alexander Shraer [mailto:shralex@gmail.com
> > <mailto:shralex@gmail.com><mailto:shralex@gmail.com
> > <mailto:shralex@gmail.com>>]
> > Sent: Tuesday, September 29, 2015 12:11 =D8=B5
> > To: user@zookeeper.apache.org
> > <mailto:user@zookeeper.apache.org><mailto:user@zookeeper.apache.org
> > <mailto:user@zookeeper.apache.org>>
> > Subject: Re: 3-server Zab cluster
> >
> > I'm not 100% sure whether operations that were pending on the leader ar=
e
> sent out during sync when this leader looses quorum and re-elected. If so=
,
> then maybe you're right. But in any case, this would not work for 5 or mo=
re
> servers...
> >
> > On Mon, Sep 28, 2015 at 3:51 PM, Ibrahim El-sanosi (PGR) <
> i.s.el-sanosi@newcastle.ac.uk <mailto:i.s.el-sanosi@newcastle.ac.uk
> ><mailto:i.s.el-sanosi@newcastle.ac.uk <mailto:
> i.s.el-sanosi@newcastle.ac.uk>>> wrote:
> >
> >> Thank you Alex for replaying.
> >>
> >> When you said " the leader gets re-elected and the operation is
> >> truncated from logs at other servers". I though the new leader will
> >> sync the its logs with other followers (synchronization phase),
> >> resulting in the operation will commit by new quorum.  Let me make the
> scenarios as steps:
> >>
> >> 1. leader  (L)  sends a proposal p with zxid =3D10 to F1 and F2.
> >> 2. F1 logs, sends an ACK, commits, replays to clients and crashes. F2
> >> crashes before receiving P10. L has not received any ACKs
> >>
> >> Possible solution  (1)
> >> The leader will move to LOOKING phase as there is no quorum
> >> supporting its leadership. Now Assume F2 wakes up. F2 forms a quorum
> >> with the L (pervious leader), L becomes new leader again as it has
> latest zxid (10) in its log.
> >> L syncs its state with F2, as a result L, F1 (before crashing) and F2
> >> commit P10.  Is that correct?
> >>
> >> Possible solution  (2)
> >> The leader will move to LOOKING phase as there is no quorum
> >> supporting its leadership. Now Assume F1 (with Zxid =3D10  committed)
> >> wakes up. I am not sure who should be a leader (F1 with Zxid =3D10
> >> committed or L (pervious
> >> leader) with Zxid =3D 10 logged), I think F1 become a new leader as it
> >> has Zxid =3D 10 committed. F1 forms a quorum with the L (pervious
> >> leader), F1 becomes new leader as it has latest zxid (10) . L (new
> >> leader) syncs its state with L (pervious leader now become a
> >> follower), as a result Zxid10 commits by new quorum.  Is that correct?
> >>
> >> What do you think?
> >>
> >> Ibrahim
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Alexander Shraer [mailto:shralex@gmail.com
> >> <mailto:shralex@gmail.com><mailto:shralex@gmail.com
> >> <mailto:shralex@gmail.com>>]
> >> Sent: Monday, September 28, 2015 07:27 =D9=85
> >> To: user@zookeeper.apache.org
> >> <mailto:user@zookeeper.apache.org><mailto:user@zookeeper.apache.org
> >> <mailto:user@zookeeper.apache.org>>
> >> Cc: dev@zookeeper.apache.org
> >> <mailto:dev@zookeeper.apache.org><mailto:dev@zookeeper.apache.org
> >> <mailto:dev@zookeeper.apache.org>>
> >> Subject: Re: 3-server Zab cluster
> >>
> >> Committing locally when sending an ACK at a server would lead to loss
> >> of consistency - it is possible that this is the only server that
> >> acks, e.g., this server is temporarily disconnected from the leader,
> >> the leader gets re-elected and the operation is truncated from logs
> >> at other servers. Its ok to ACK it but its not ok to commit since
> >> this exposes this to users as a committed operation that they can see.
> >>
> >> On Mon, Sep 28, 2015 at 4:19 AM, Ibrahim El-sanosi (PGR) <
> >> i.s.el-sanosi@newcastle.ac.uk <mailto:i.s.el-sanosi@newcastle.ac.uk
> ><mailto:i.s.el-sanosi@newcastle.ac.uk <mailto:
> i.s.el-sanosi@newcastle.ac.uk>>> wrote:
> >>
> >>> In Zab, assume we have a cluster consists of 3-servers. To deliver a
> >>> write request, it must run 3 communication steps proposal,
> >>> acknowledgement and commit.
> >>> As Zab uses reliable FIFO, it is possible to remove commit round. As
> >>> soon as a follower receives a proposal, it logs, sends an ACK and
> >>> commits locally. Upon receiving ACK from any follower, leader
> >>> commits a proposal locally, no COMMIT message need to be sent to
> >>> followers. In this case, all servers commit a proposal in two
> >>> round-trips, resulting in reducing latency particularly in followers.
> >>>
> >>> Note that this optimization can only work in 3-servers cluster
> >>> (follower reaches a majority as soon as it acks).
> >>> Does anyone see any problems with such (small) optimization?
> >>> Ibrahim
>
>

--001a11359b029352ff05215f4473--