Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Subject: Re: Recovery time (was: Maximum size of a snapshot)
From: Flavio Junqueira <fpjunqueira@yahoo.com>
In-Reply-To: <9B8C8395-14D1-4FEA-86BF-56D0F76FAB79@yahoo.com>
Date: Wed, 17 Jul 2013 15:43:54 +0200
Cc: "dev@zookeeper.apache.org" <dev@zookeeper.apache.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <8699B550-87FD-4094-81B3-B4D0A93C6E3B@yahoo.com>
References: 
 <AA0B8A9CF5974846889A3DF776C1DF17569D0671@PRN-MBX02-2.TheFacebook.com>
 <9B8C8395-14D1-4FEA-86BF-56D0F76FAB79@yahoo.com>
To: user@zookeeper.apache.org

I need to also mention ZOOKEEPER-1549 in the context of point (2) below. =
That's a blocker for 3.5.0.=20

-Flavio

On Jul 17, 2013, at 12:30 PM, Flavio Junqueira <fpjunqueira@yahoo.com> =
wrote:

> Moving the discussion to dev but keeping user on CC.
>=20
> Let's step back. The reason why we started the latest discussion in =
this thread was because Kishore is concerned about recovery time. There =
are a number of improvements we have been looking at for the next =
release, let me go over my current understanding of the main points that =
add to the recovery time:
>=20
> 1- Before we even start leader election, each server loads state from =
disk to determine its last zxid. The last zxid is used in the election;
> 2- Once the leader is elected, it loads state from disk and take a =
snapshot. Loading the database again is unecessary (ZOOKEEPER-1642) and =
the snapshot adds latency. In fact, it is not even correct to have it =
there (ZOOKEEPER-1558).
> 3- A follower takes a snapshot before acknowledging the NEWLEADER =
message, so the leader has to wait until a quorum of followers finishes =
their snapshot.
>=20
> The proposal I've heard here is to touch (1). For now, I'd rather keep =
(1) as is and focus on fixing (2). We might be able to do something =
about (3) and I'm actually not sure if there has been a discussion about =
it or not.
>=20
> -Flavio
>=20
> On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <thawan@fb.com> wrote:
>=20
>> Client will get session expire event only when a server explicitly =
tells
>> the client. So any established sessions will remain in a disconnected
>> state during the period
>>=20
>> So my comment about the need for longer session timeout might be
>> incorrect. While the quorum is down during leader election, session =
won't
>> expire during this period. When the quorum comes back, the client =
have to
>> reconnect within session timeout in order to resume the session.  =
However,
>> client won't be able to issue any read/write request or create a new
>> session while the quorum is down.
>>=20
>> However, some application may need a stronger consistency guarantee. =
They
>> will have a special logic to abort the client if it was disconnected =
for
>> an extended period. This is because the client won't be able to tell =
if
>> the quorum is down or there is a network partition between the client =
and
>> the quorum.=20
>>=20
>>=20
>> --=20
>> Thawan Kooburat
>>=20
>>=20
>>=20
>>=20
>>=20
>> On 7/16/13 6:46 PM, "kishore g" <g.kishore@gmail.com> wrote:
>>=20
>>> Thanks Thawan. Another question to follow up, so lets say client c1 =
is
>>> connected to leader and leader fails. Now c1 is trying to connect to
>>> another zk server but all servers are busy loading snapshot and can =
take a
>>> minute or two. According to Flavio zk servers dont accept any =
request
>>> while
>>> synchronization, but most clients dont keep that high connection =
timeout.
>>> So does this mean clients will timeout on connection?. Is my =
understanding
>>> correct or zk servers will accept connection requests but reject
>>> read/write
>>> requests.
>>>=20
>>> thanks,
>>> Kishore G
>>>=20
>>>=20
>>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <thawan@fb.com> =
wrote:
>>>=20
>>>> There is a plan to work on this optimization ZOOKEEPER-1674.
>>>>=20
>>>>=20
>>>> --
>>>> Thawan Kooburat
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>> On 7/16/13 1:37 PM, "kishore g" <g.kishore@gmail.com> wrote:
>>>>=20
>>>>> All servers in the quorum reading the snapshot from disk as part =
of the
>>>>> synchronization phase. =46rom Thawan's email it looks like when =
ever
>>>> there
>>>>> is
>>>>> a leader election, all zk servers read the snapshot from disk. I =
am not
>>>>> sure why all servers should reload the snapshot from disk as this
>>>>> increases
>>>>> unavailability time.
>>>>>=20
>>>>>=20
>>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira
>>>>> <fpjunqueira@yahoo.com>wrote:
>>>>>=20
>>>>>> The synchronization phase is part of the protocol and we use it =
to
>>>>>> guarantee that we expose a consistent view of the state. During =
the
>>>>>> synchronization phase, servers do not accept requests.
>>>>>>=20
>>>>>> Which behavior are you proposing we change, Kishore?
>>>>>>=20
>>>>>> -Flavio
>>>>>>=20
>>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <g.kishore@gmail.com> =
wrote:
>>>>>>=20
>>>>>>> Thanks for clarification Flavio. Does this mean during the =
leader
>>>>>> election,
>>>>>>> both reads and writes are not supported?. Do we start a separate
>>>>>>> thread/jira of changing this behavior?.
>>>>>>>=20
>>>>>>> thanks,
>>>>>>> Kishore G
>>>>>>>=20
>>>>>>>=20
>>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira
>>>>>> <fpjunqueira@yahoo.com
>>>>>>> wrote:
>>>>>>>=20
>>>>>>>> The disk state should be the authoritative state of a server, =
so
>>>> if I
>>>>>>>> remember correctly, we load the database as a way of validating
>>>> the
>>>>>> disk
>>>>>>>> state. I don't claim that this is strictly necessary, but if we
>>>> are
>>>>>> to
>>>>>>>> change it, then I would need to think this through.
>>>>>>>>=20
>>>>>>>> About leader election, if a leader loses support from a quorum =
of
>>>>>>>> followers,
>>>>>>>> then it will drop leadership. Any event that causes a follower =
to
>>>>>> stop
>>>>>>>> receiving messages from the leader or the follower to =
disconnect
>>>> from
>>>>>> the
>>>>>>>> leader will make it stop supporting the current leader.
>>>>>>>>=20
>>>>>>>> -Flavio
>>>>>>>>=20
>>>>>>>> -----Original Message-----
>>>>>>>> From: Sergey Maslyakov [mailto:evolvah@gmail.com]
>>>>>>>> Sent: 16 July 2013 16:16
>>>>>>>> To: user@zookeeper.apache.org
>>>>>>>> Subject: Re: Maximum size of a snapshot
>>>>>>>>=20
>>>>>>>> And another extension on top of Kishore's question: do the
>>>>>> reelections
>>>>>>>> happen if the previously elected leader remains in the cluster? =
In
>>>>>> other
>>>>>>>> words, what events can trigger re-election and the =
corresponding
>>>>>> temporary
>>>>>>>> degradation of the service provided by Zookeeper?
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> Thank you,
>>>>>>>> /Sergey
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g =
<g.kishore@gmail.com>
>>>>>> wrote:
>>>>>>>>=20
>>>>>>>>> Regarding #2. Is that really true that during leader election
>>>> every
>>>>>>>>> machine reloads snapshot data from disk? Any reason why this =
is
>>>>>> needed
>>>>>>>>> unless it really needs to truncate or undo conflicting
>>>> transactions
>>>>>>>> already applied?
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat =
<thawan@fb.com>
>>>>>> wrote:
>>>>>>>>>=20
>>>>>>>>>> Max snapshot size:
>>>>>>>>>>=20
>>>>>>>>>> Here is my take on these issue,  others feel free to add or
>>>>>> correct.
>>>>>>>>>>=20
>>>>>>>>>> 1. Depends on how much RAM your machine has.  Snapshot is
>>>> should be
>>>>>>>>>> less than the available RAM since everything is loaded into
>>>> memory.
>>>>>>>>>> 2. Depends on what is the availability guarantee that the =
client
>>>>>> needs.
>>>>>>>>>> If there is leader election, every machine need to reload the
>>>> data
>>>>>>>>>> from disk. So the quorum will be down for at least the same =
as
>>>>>>>>>> snapshot
>>>>>>>>> loading
>>>>>>>>>> time. The session timeout on the client side should be at =
least
>>>>>>>>>> longer than expected downtime during leader election.
>>>>>>>>>>=20
>>>>>>>>>> --
>>>>>>>>>> Thawan Kooburat
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <evolvah@gmail.com>
>>>> wrote:
>>>>>>>>>>=20
>>>>>>>>>>> I have a couple of sizing questions to the users and
>>>> developers.
>>>>>>>>>>> Hope,
>>>>>>>>> you
>>>>>>>>>>> don't mind answering those.
>>>>>>>>>>>=20
>>>>>>>>>>> What is the guideline for the maximum reasonable size of a
>>>>>> DataTree
>>>>>>>>> that a
>>>>>>>>>>> single ZK server can manage? If ZK server writes out a
>>>> snapshot of
>>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it
>>>> still
>>>>>>>> manageable?
>>>>>>>>> If
>>>>>>>>>>> so, where is the critical threshold when ZK is really being
>>>>>> abused?
>>>>>>>>>>>=20
>>>>>>>>>>> Similarly, how can I estimate the propagation delay of a =
change
>>>>>>>>>>> across
>>>>>>>>> an
>>>>>>>>>>> ensemble of three ZK servers?
>>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>>> Thank you,
>>>>>>>>>>> /Sergey
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>=20
>>>>=20
>>=20
>=20