Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E942C10A52 for ; Tue, 16 Jul 2013 22:46:32 +0000 (UTC) Received: (qmail 31400 invoked by uid 500); 16 Jul 2013 22:46:32 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 31330 invoked by uid 500); 16 Jul 2013 22:46:32 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 31322 invoked by uid 99); 16 Jul 2013 22:46:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jul 2013 22:46:32 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of prvs=09099c4fda=thawan@fb.com designates 67.231.145.42 as permitted sender) Received: from [67.231.145.42] (HELO mx0a-00082601.pphosted.com) (67.231.145.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jul 2013 22:46:25 +0000 Received: from pps.filterd (m0044010 [127.0.0.1]) by mx0a-00082601.pphosted.com (8.14.5/8.14.5) with SMTP id r6GMgNCR014754 for ; Tue, 16 Jul 2013 15:46:04 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fb.com; h=from : to : subject : date : message-id : in-reply-to : content-type : content-id : content-transfer-encoding : mime-version; s=facebook; bh=xMMlaOsrmzHyuKKOB0/TT0tDFiGF38RoDhv+vyCdmE0=; b=o+5wb6bgQg3ABkAJuqMogkuWgbRtDm9zeOIVNZlNXN/2THcW7JG5zXFBp9YX/qmutX9B E7I9G1E0F70c+7SDyo4pr9uS81jxN8SnWfuE75q9raj/l0VT+CSABNAc0p0HekZJuECA VucOVJ2cdgc5BEUX22yehlF9EBnZ/Jrvh2I= Received: from mail.thefacebook.com (prn1-cmdf-dc01-fw1-nat.corp.tfbnw.net [173.252.71.129] (may be forged)) by mx0a-00082601.pphosted.com with ESMTP id 1dndwwg4w3-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=OK) for ; Tue, 16 Jul 2013 15:46:03 -0700 Received: from PRN-MBX02-2.TheFacebook.com ([169.254.5.35]) by PRN-CHUB03.TheFacebook.com ([fe80::fd64:bd05:4514:bbad%12]) with mapi id 14.03.0146.000; Tue, 16 Jul 2013 15:46:00 -0700 From: Thawan Kooburat To: "user@zookeeper.apache.org" Subject: Re: Maximum size of a snapshot Thread-Topic: Maximum size of a snapshot Thread-Index: AQHOgdcYhJBGV9SL60Gb19TykZ1IwJlmvFOAgACfmwCAAHO7gIAAIYeAgAANmwCAACoqgIAAEWwA//+uagA= Date: Tue, 16 Jul 2013 22:45:59 +0000 Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.16.4] Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8794,1.0.431,0.0.0000 definitions=2013-07-16_09:2013-07-16,2013-07-16,1970-01-01 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org There is a plan to work on this optimization ZOOKEEPER-1674. =20 --=20 Thawan Kooburat On 7/16/13 1:37 PM, "kishore g" wrote: >All servers in the quorum reading the snapshot from disk as part of the >synchronization phase. From Thawan's email it looks like when ever there >is >a leader election, all zk servers read the snapshot from disk. I am not >sure why all servers should reload the snapshot from disk as this >increases >unavailability time. > > >On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira >wrote: > >> The synchronization phase is part of the protocol and we use it to >> guarantee that we expose a consistent view of the state. During the >> synchronization phase, servers do not accept requests. >> >> Which behavior are you proposing we change, Kishore? >> >> -Flavio >> >> On Jul 16, 2013, at 7:04 PM, kishore g wrote: >> >> > Thanks for clarification Flavio. Does this mean during the leader >> election, >> > both reads and writes are not supported?. Do we start a separate >> > thread/jira of changing this behavior?. >> > >> > thanks, >> > Kishore G >> > >> > >> > On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira >>> >wrote: >> > >> >> The disk state should be the authoritative state of a server, so if I >> >> remember correctly, we load the database as a way of validating the >>disk >> >> state. I don't claim that this is strictly necessary, but if we are >>to >> >> change it, then I would need to think this through. >> >> >> >> About leader election, if a leader loses support from a quorum of >> >> followers, >> >> then it will drop leadership. Any event that causes a follower to >>stop >> >> receiving messages from the leader or the follower to disconnect from >> the >> >> leader will make it stop supporting the current leader. >> >> >> >> -Flavio >> >> >> >> -----Original Message----- >> >> From: Sergey Maslyakov [mailto:evolvah@gmail.com] >> >> Sent: 16 July 2013 16:16 >> >> To: user@zookeeper.apache.org >> >> Subject: Re: Maximum size of a snapshot >> >> >> >> And another extension on top of Kishore's question: do the >>reelections >> >> happen if the previously elected leader remains in the cluster? In >>other >> >> words, what events can trigger re-election and the corresponding >> temporary >> >> degradation of the service provided by Zookeeper? >> >> >> >> >> >> Thank you, >> >> /Sergey >> >> >> >> >> >> On Tue, Jul 16, 2013 at 2:21 AM, kishore g >>wrote: >> >> >> >>> Regarding #2. Is that really true that during leader election every >> >>> machine reloads snapshot data from disk? Any reason why this is >>needed >> >>> unless it really needs to truncate or undo conflicting transactions >> >> already applied? >> >>> >> >>> >> >>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat >> wrote: >> >>> >> >>>> Max snapshot size: >> >>>> >> >>>> Here is my take on these issue, others feel free to add or >>correct. >> >>>> >> >>>> 1. Depends on how much RAM your machine has. Snapshot is should be >> >>>> less than the available RAM since everything is loaded into memory. >> >>>> 2. Depends on what is the availability guarantee that the client >> needs. >> >>>> If there is leader election, every machine need to reload the data >> >>>> from disk. So the quorum will be down for at least the same as >> >>>> snapshot >> >>> loading >> >>>> time. The session timeout on the client side should be at least >> >>>> longer than expected downtime during leader election. >> >>>> >> >>>> -- >> >>>> Thawan Kooburat >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" wrote: >> >>>> >> >>>>> I have a couple of sizing questions to the users and developers. >> >>>>> Hope, >> >>> you >> >>>>> don't mind answering those. >> >>>>> >> >>>>> What is the guideline for the maximum reasonable size of a >>DataTree >> >>> that a >> >>>>> single ZK server can manage? If ZK server writes out a snapshot of >> >>>>> about 1GB in size, is it pushed beyond the limits or is it still >> >> manageable? >> >>> If >> >>>>> so, where is the critical threshold when ZK is really being >>abused? >> >>>>> >> >>>>> Similarly, how can I estimate the propagation delay of a change >> >>>>> across >> >>> an >> >>>>> ensemble of three ZK servers? >> >>>>> >> >>>>> >> >>>>> Thank you, >> >>>>> /Sergey >> >>>> >> >>>> >> >>> >> >> >> >> >> >>