Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5A6931860A for ; Sat, 27 Jun 2015 01:11:04 +0000 (UTC) Received: (qmail 95585 invoked by uid 500); 27 Jun 2015 01:11:03 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 95536 invoked by uid 500); 27 Jun 2015 01:11:03 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 95525 invoked by uid 99); 27 Jun 2015 01:11:03 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Jun 2015 01:11:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 01E47D0866 for ; Sat, 27 Jun 2015 01:11:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.973 X-Spam-Level: *** X-Spam-Status: No, score=3.973 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_SOFTFAIL=0.972, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id W5s5pFcx4LMi for ; Sat, 27 Jun 2015 01:10:53 +0000 (UTC) Received: from mail-ob0-f173.google.com (mail-ob0-f173.google.com [209.85.214.173]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 1742920925 for ; Sat, 27 Jun 2015 01:10:52 +0000 (UTC) Received: by obctg8 with SMTP id tg8so76571149obc.3 for ; Fri, 26 Jun 2015 18:10:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=VF8l15ZOIrZbTX0FlSDwlPyD8Uy3c84WyHkpsN3srFI=; b=mxz7bi5KhQLYBcGFjdy3O8DcdosofXyjh5RL3MNxa3KC8b13rtqM3dWP2RTjGVWSx0 2OplB1ST1XIazrljRC8PVxr7w9QGW6ioTpnhCitSdfIQ3kUC0dWkRdv1L5Equ5+Sjwpx QuQZx0tdtvTWQ5IAnpDES3SIAuUA9yb3SPoQdDUGuBw/pdDCaHfJ8dVJBsA8E4fgbrNk j+avVE4L4HhEw5mcCRLm4nzL+aKeEsxkQQEnABhhPU8+f7ZNJWPoQcXhWCcIJbK2xS9V btbYFCPuaPAO5+VdaKx7MQSe5ESk4+OcXvZAuJhIk4I1Ud1SM6uGEw2QewTyiAzUo3do Dtag== X-Gm-Message-State: ALoCoQme7nTdk+bcg0Na58tf6wiR5ajICfh+xwi/J899h1M9jwgVfeO0xhdoC3LIur9jQiMP1I5b MIME-Version: 1.0 X-Received: by 10.60.60.230 with SMTP id k6mr3936215oer.21.1435367445739; Fri, 26 Jun 2015 18:10:45 -0700 (PDT) Received: by 10.202.48.145 with HTTP; Fri, 26 Jun 2015 18:10:45 -0700 (PDT) X-Originating-IP: [8.25.197.26] In-Reply-To: References: Date: Fri, 26 Jun 2015 18:10:45 -0700 Message-ID: Subject: Re: Tracking down possible network partition From: =?UTF-8?B?UmHDumwgR3V0acOpcnJleiBTZWdhbMOpcw==?= To: "user@zookeeper.apache.org" Content-Type: multipart/alternative; boundary=089e01538c58b9d5020519758671 --089e01538c58b9d5020519758671 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 25 June 2015 at 07:28, Round, Mark wrote: > I have a 5-node Zookeeper 3.4.6 cluster across 3 data centres (2 > zookeepers in each =E2=80=9Cmain=E2=80=9D DC, and a 5th in a 3rd DC for q= uorum). I see that > the two nodes in one DC have regular =E2=80=9Cissues=E2=80=9D where they = get kicked out of > the cluster and the ZooKeeperServer process stops for a few minutes until > the node rejoins. I=E2=80=99d like to know a couple of things, if someone= could > please point me in the direction of the relevant docs I=E2=80=99d greatly > appreciate it. > > 1.) Is it expected behaviour that when a node is kicked from the cluster, > it will not be allowed to re-join for a period ? From the logs below I ca= n > see that re-establishing a valid cluster took around 15 minutes. > I don't think so. 2.) It appears that the leader closes connections to the affected followers > after a =E2=80=9Ctransaction timeout=E2=80=9D occurs. Where would I find = out what this > timeout is ? Is this the same thing as a session timout (e.g. The default > of 20 * tickTime) ? > https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zoo= keeper/server/quorum/LearnerHandler.java#L496 > 3.) Where can I find the definition of the different fields in the > election log messages (I.e. What are =E2=80=9Cn.round=E2=80=9D, =E2=80=9C= n.zxid=E2=80=9D, =E2=80=9Cn.state=E2=80=9D and so > on) ? Not sure if there's a better source than the source: https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zoo= keeper/server/quorum/FastLeaderElection.java#L687 -rgs --089e01538c58b9d5020519758671--