Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BD0DB18886 for ; Fri, 10 Jul 2015 18:21:42 +0000 (UTC) Received: (qmail 61810 invoked by uid 500); 10 Jul 2015 18:21:42 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 61761 invoked by uid 500); 10 Jul 2015 18:21:42 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 61750 invoked by uid 99); 10 Jul 2015 18:21:41 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jul 2015 18:21:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5B94318199F for ; Fri, 10 Jul 2015 18:21:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.001 X-Spam-Level: *** X-Spam-Status: No, score=3.001 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id kPYfRo3DT4KU for ; Fri, 10 Jul 2015 18:21:30 +0000 (UTC) Received: from COL004-OMC3S8.hotmail.com (col004-omc3s8.hotmail.com [65.55.34.146]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 144D420E74 for ; Fri, 10 Jul 2015 18:21:28 +0000 (UTC) Received: from COL126-W30 ([65.55.34.135]) by COL004-OMC3S8.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Fri, 10 Jul 2015 11:21:21 -0700 X-TMN: [s/jHNkyVBfKf7hHdURIfiZwF+gc5khU7] X-Originating-Email: [nairsom@outlook.com] Message-ID: Content-Type: multipart/alternative; boundary="_bfb374e0-1916-4440-a3f4-b59c114a3a90_" From: "snair123 ." To: "user@zookeeper.apache.org" Subject: RE: Tracking down possible network partition Date: Fri, 10 Jul 2015 18:21:20 +0000 Importance: Normal In-Reply-To: References: , MIME-Version: 1.0 X-OriginalArrivalTime: 10 Jul 2015 18:21:21.0068 (UTC) FILETIME=[3898C6C0:01D0BB3D] --_bfb374e0-1916-4440-a3f4-b59c114a3a90_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable 2.) It appears that the leader closes connections to the affected followers after a =93transaction timeout=94 occurs. Where would I find out what this timeout is ? Is this the same thing as a session timout (e.g. The default of 20 * tickTime) ? =20 https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zo= okeeper/server/quorum/LearnerHandler.java#L496 a. So the Leader closes connections to Followers and Observers after syncLi= mit*ticktime milliseconds ?=20 b. So what purpose does the SyncLimit serve in followers and in observers ? c. If i needed the Observer to stay connected to ZKEnsemble for a longer ti= me - in case of network partitiions - do i increase the syncLimit at the le= ader or at the Observer ? > Date: Fri=2C 26 Jun 2015 18:10:45 -0700 > Subject: Re: Tracking down possible network partition > From: rgs@itevenworks.net > To: user@zookeeper.apache.org >=20 > On 25 June 2015 at 07:28=2C Round=2C Mark wrote: >=20 > > I have a 5-node Zookeeper 3.4.6 cluster across 3 data centres (2 > > zookeepers in each =93main=94 DC=2C and a 5th in a 3rd DC for quorum). = I see that > > the two nodes in one DC have regular =93issues=94 where they get kicked= out of > > the cluster and the ZooKeeperServer process stops for a few minutes unt= il > > the node rejoins. I=92d like to know a couple of things=2C if someone c= ould > > please point me in the direction of the relevant docs I=92d greatly > > appreciate it. > > > > 1.) Is it expected behaviour that when a node is kicked from the cluste= r=2C > > it will not be allowed to re-join for a period ? From the logs below I = can > > see that re-establishing a valid cluster took around 15 minutes. > > >=20 > I don't think so. >=20 > 2.) It appears that the leader closes connections to the affected followe= rs > > after a =93transaction timeout=94 occurs. Where would I find out what t= his > > timeout is ? Is this the same thing as a session timout (e.g. The defau= lt > > of 20 * tickTime) ? > > >=20 > https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/z= ookeeper/server/quorum/LearnerHandler.java#L496 >=20 >=20 > > 3.) Where can I find the definition of the different fields in the > > election log messages (I.e. What are =93n.round=94=2C =93n.zxid=94=2C = =93n.state=94 and so > > on) ? >=20 >=20 > Not sure if there's a better source than the source: > https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/z= ookeeper/server/quorum/FastLeaderElection.java#L687 >=20 >=20 >=20 > -rgs = --_bfb374e0-1916-4440-a3f4-b59c114a3a90_--