Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7DBB5CE48 for ; Wed, 17 Jul 2013 13:44:37 +0000 (UTC) Received: (qmail 57623 invoked by uid 500); 17 Jul 2013 13:44:36 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 56954 invoked by uid 500); 17 Jul 2013 13:44:28 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 56939 invoked by uid 99); 17 Jul 2013 13:44:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jul 2013 13:44:26 +0000 X-ASF-Spam-Status: No, hits=1.0 required=5.0 tests=FORGED_YAHOO_RCVD,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [77.238.189.68] (HELO nm15.bullet.mail.ird.yahoo.com) (77.238.189.68) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jul 2013 13:44:19 +0000 Received: from [77.238.189.236] by nm15.bullet.mail.ird.yahoo.com with NNFMP; 17 Jul 2013 13:43:57 -0000 Received: from [46.228.39.71] by tm17.bullet.mail.ird.yahoo.com with NNFMP; 17 Jul 2013 13:43:57 -0000 Received: from [127.0.0.1] by smtp108.mail.ir2.yahoo.com with NNFMP; 17 Jul 2013 13:43:57 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1374068637; bh=aHYcC1ZEQead6wHExGKQ3UxxQR4ARmwPHXEPl3cOLK8=; h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:X-Rocket-Received:Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Cc:Content-Transfer-Encoding:Message-Id:References:To:X-Mailer; b=4f+MzUwtVETtywMOZhoM/D0e8Vh7xj12MuxhCM0ZN4L7zw7NnpyRA3YZ8xLsi4WiHJAyF0uS0ncJHVs4QBzJoTMWF+O5tLArsE+8OMXeywNk/GmiHwd+pCvystg0lV3Xrrn+ZG0rDJ7qagZ9JrpHdmhAlxvzgQ5MlGpNg4fqk9E= X-Yahoo-Newman-Id: 135629.71930.bm@smtp108.mail.ir2.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: cX5qjqgVM1m64NjEl9MCB92e1jMUH8Nh9hjtQlXYFIzIPfp Y._okZ1VCqRqXBc3htUKxvA3B1r7cHWYLjfdX.XF5Kgb.3B2I6_SIocIqQ0g vrL8giaOwme2lL.7cwLHR36Tk105ZTRtjgG2U_YZurDd_N4zvK.sBDH7Bl3e RB5_Cs9BcvD6IYRQivZ9keAJo5ojHm2IiN0ONr05W6wmH3rouZCGHcD0UYta jh2pP_F0FAJhZiynb2VmySchvvJkorrqiFaaJploc.7wYpga89Ja1DiMS9NH 7eirMnpgyhmXeQMYSTJWdzyL5hY1xf2Li1TXoC0jc51fWglZzcDLilN39IKk rEGLu3RAqng1Stb5_nNn01Osezp7ecsETWwBpFs7Mu6or721g4fJmvCk4Lc1 s4q3why0ppCJ3q6tVb.pP_IpCvo0ebGbWYu9Y5p8pVsI73yflE7ojnVV67_G ULmsRI00jNn6AT2XAEj7yzSiErLMQh7z.roaSMdkFsWw8lllAXd4S X-Yahoo-SMTP: HT5UJDeswBACWJPOeualxAa.da..S.fl X-Rocket-Received: from [192.168.1.34] (fpjunqueira@88.6.168.112 with ) by smtp108.mail.ir2.yahoo.com with SMTP; 17 Jul 2013 13:43:56 +0000 UTC Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) Subject: Re: Recovery time (was: Maximum size of a snapshot) From: Flavio Junqueira In-Reply-To: <9B8C8395-14D1-4FEA-86BF-56D0F76FAB79@yahoo.com> Date: Wed, 17 Jul 2013 15:43:54 +0200 Cc: "dev@zookeeper.apache.org" Content-Transfer-Encoding: quoted-printable Message-Id: <8699B550-87FD-4094-81B3-B4D0A93C6E3B@yahoo.com> References: <9B8C8395-14D1-4FEA-86BF-56D0F76FAB79@yahoo.com> To: user@zookeeper.apache.org X-Mailer: Apple Mail (2.1503) X-Virus-Checked: Checked by ClamAV on apache.org I need to also mention ZOOKEEPER-1549 in the context of point (2) below. = That's a blocker for 3.5.0.=20 -Flavio On Jul 17, 2013, at 12:30 PM, Flavio Junqueira = wrote: > Moving the discussion to dev but keeping user on CC. >=20 > Let's step back. The reason why we started the latest discussion in = this thread was because Kishore is concerned about recovery time. There = are a number of improvements we have been looking at for the next = release, let me go over my current understanding of the main points that = add to the recovery time: >=20 > 1- Before we even start leader election, each server loads state from = disk to determine its last zxid. The last zxid is used in the election; > 2- Once the leader is elected, it loads state from disk and take a = snapshot. Loading the database again is unecessary (ZOOKEEPER-1642) and = the snapshot adds latency. In fact, it is not even correct to have it = there (ZOOKEEPER-1558). > 3- A follower takes a snapshot before acknowledging the NEWLEADER = message, so the leader has to wait until a quorum of followers finishes = their snapshot. >=20 > The proposal I've heard here is to touch (1). For now, I'd rather keep = (1) as is and focus on fixing (2). We might be able to do something = about (3) and I'm actually not sure if there has been a discussion about = it or not. >=20 > -Flavio >=20 > On Jul 17, 2013, at 5:40 AM, Thawan Kooburat wrote: >=20 >> Client will get session expire event only when a server explicitly = tells >> the client. So any established sessions will remain in a disconnected >> state during the period >>=20 >> So my comment about the need for longer session timeout might be >> incorrect. While the quorum is down during leader election, session = won't >> expire during this period. When the quorum comes back, the client = have to >> reconnect within session timeout in order to resume the session. = However, >> client won't be able to issue any read/write request or create a new >> session while the quorum is down. >>=20 >> However, some application may need a stronger consistency guarantee. = They >> will have a special logic to abort the client if it was disconnected = for >> an extended period. This is because the client won't be able to tell = if >> the quorum is down or there is a network partition between the client = and >> the quorum.=20 >>=20 >>=20 >> --=20 >> Thawan Kooburat >>=20 >>=20 >>=20 >>=20 >>=20 >> On 7/16/13 6:46 PM, "kishore g" wrote: >>=20 >>> Thanks Thawan. Another question to follow up, so lets say client c1 = is >>> connected to leader and leader fails. Now c1 is trying to connect to >>> another zk server but all servers are busy loading snapshot and can = take a >>> minute or two. According to Flavio zk servers dont accept any = request >>> while >>> synchronization, but most clients dont keep that high connection = timeout. >>> So does this mean clients will timeout on connection?. Is my = understanding >>> correct or zk servers will accept connection requests but reject >>> read/write >>> requests. >>>=20 >>> thanks, >>> Kishore G >>>=20 >>>=20 >>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat = wrote: >>>=20 >>>> There is a plan to work on this optimization ZOOKEEPER-1674. >>>>=20 >>>>=20 >>>> -- >>>> Thawan Kooburat >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>> On 7/16/13 1:37 PM, "kishore g" wrote: >>>>=20 >>>>> All servers in the quorum reading the snapshot from disk as part = of the >>>>> synchronization phase. =46rom Thawan's email it looks like when = ever >>>> there >>>>> is >>>>> a leader election, all zk servers read the snapshot from disk. I = am not >>>>> sure why all servers should reload the snapshot from disk as this >>>>> increases >>>>> unavailability time. >>>>>=20 >>>>>=20 >>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira >>>>> wrote: >>>>>=20 >>>>>> The synchronization phase is part of the protocol and we use it = to >>>>>> guarantee that we expose a consistent view of the state. During = the >>>>>> synchronization phase, servers do not accept requests. >>>>>>=20 >>>>>> Which behavior are you proposing we change, Kishore? >>>>>>=20 >>>>>> -Flavio >>>>>>=20 >>>>>> On Jul 16, 2013, at 7:04 PM, kishore g = wrote: >>>>>>=20 >>>>>>> Thanks for clarification Flavio. Does this mean during the = leader >>>>>> election, >>>>>>> both reads and writes are not supported?. Do we start a separate >>>>>>> thread/jira of changing this behavior?. >>>>>>>=20 >>>>>>> thanks, >>>>>>> Kishore G >>>>>>>=20 >>>>>>>=20 >>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira >>>>>> >>>>>> wrote: >>>>>>>=20 >>>>>>>> The disk state should be the authoritative state of a server, = so >>>> if I >>>>>>>> remember correctly, we load the database as a way of validating >>>> the >>>>>> disk >>>>>>>> state. I don't claim that this is strictly necessary, but if we >>>> are >>>>>> to >>>>>>>> change it, then I would need to think this through. >>>>>>>>=20 >>>>>>>> About leader election, if a leader loses support from a quorum = of >>>>>>>> followers, >>>>>>>> then it will drop leadership. Any event that causes a follower = to >>>>>> stop >>>>>>>> receiving messages from the leader or the follower to = disconnect >>>> from >>>>>> the >>>>>>>> leader will make it stop supporting the current leader. >>>>>>>>=20 >>>>>>>> -Flavio >>>>>>>>=20 >>>>>>>> -----Original Message----- >>>>>>>> From: Sergey Maslyakov [mailto:evolvah@gmail.com] >>>>>>>> Sent: 16 July 2013 16:16 >>>>>>>> To: user@zookeeper.apache.org >>>>>>>> Subject: Re: Maximum size of a snapshot >>>>>>>>=20 >>>>>>>> And another extension on top of Kishore's question: do the >>>>>> reelections >>>>>>>> happen if the previously elected leader remains in the cluster? = In >>>>>> other >>>>>>>> words, what events can trigger re-election and the = corresponding >>>>>> temporary >>>>>>>> degradation of the service provided by Zookeeper? >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> Thank you, >>>>>>>> /Sergey >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g = >>>>>> wrote: >>>>>>>>=20 >>>>>>>>> Regarding #2. Is that really true that during leader election >>>> every >>>>>>>>> machine reloads snapshot data from disk? Any reason why this = is >>>>>> needed >>>>>>>>> unless it really needs to truncate or undo conflicting >>>> transactions >>>>>>>> already applied? >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat = >>>>>> wrote: >>>>>>>>>=20 >>>>>>>>>> Max snapshot size: >>>>>>>>>>=20 >>>>>>>>>> Here is my take on these issue, others feel free to add or >>>>>> correct. >>>>>>>>>>=20 >>>>>>>>>> 1. Depends on how much RAM your machine has. Snapshot is >>>> should be >>>>>>>>>> less than the available RAM since everything is loaded into >>>> memory. >>>>>>>>>> 2. Depends on what is the availability guarantee that the = client >>>>>> needs. >>>>>>>>>> If there is leader election, every machine need to reload the >>>> data >>>>>>>>>> from disk. So the quorum will be down for at least the same = as >>>>>>>>>> snapshot >>>>>>>>> loading >>>>>>>>>> time. The session timeout on the client side should be at = least >>>>>>>>>> longer than expected downtime during leader election. >>>>>>>>>>=20 >>>>>>>>>> -- >>>>>>>>>> Thawan Kooburat >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" >>>> wrote: >>>>>>>>>>=20 >>>>>>>>>>> I have a couple of sizing questions to the users and >>>> developers. >>>>>>>>>>> Hope, >>>>>>>>> you >>>>>>>>>>> don't mind answering those. >>>>>>>>>>>=20 >>>>>>>>>>> What is the guideline for the maximum reasonable size of a >>>>>> DataTree >>>>>>>>> that a >>>>>>>>>>> single ZK server can manage? If ZK server writes out a >>>> snapshot of >>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it >>>> still >>>>>>>> manageable? >>>>>>>>> If >>>>>>>>>>> so, where is the critical threshold when ZK is really being >>>>>> abused? >>>>>>>>>>>=20 >>>>>>>>>>> Similarly, how can I estimate the propagation delay of a = change >>>>>>>>>>> across >>>>>>>>> an >>>>>>>>>>> ensemble of three ZK servers? >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> Thank you, >>>>>>>>>>> /Sergey >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>=20 >>>>=20 >>=20 >=20