From user-return-11627-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org  Tue Aug 21 11:29:55 2018
Return-Path: <user-return-11627-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 6F765180630
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 21 Aug 2018 11:29:54 +0200 (CEST)
Received: (qmail 2634 invoked by uid 500); 21 Aug 2018 09:29:53 -0000
Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@zookeeper.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@zookeeper.apache.org>
List-Post: <mailto:user@zookeeper.apache.org>
List-Id: <user.zookeeper.apache.org>
Reply-To: user@zookeeper.apache.org
Delivered-To: mailing list user@zookeeper.apache.org
Received: (qmail 2622 invoked by uid 99); 21 Aug 2018 09:29:52 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Aug 2018 09:29:52 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 267DAC92DA
	for <user@zookeeper.apache.org>; Tue, 21 Aug 2018 09:29:52 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.888
X-Spam-Level: *
X-Spam-Status: No, score=1.888 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id 2egrpEm0qedL for <user@zookeeper.apache.org>;
	Tue, 21 Aug 2018 09:29:49 +0000 (UTC)
Received: from mail-ua1-f48.google.com (mail-ua1-f48.google.com [209.85.222.48])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id BA1C05F3CE
	for <user@zookeeper.apache.org>; Tue, 21 Aug 2018 09:29:48 +0000 (UTC)
Received: by mail-ua1-f48.google.com with SMTP id y10-v6so11448855uao.4
        for <user@zookeeper.apache.org>; Tue, 21 Aug 2018 02:29:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=PCfgW/VCW9VYOGAwq2YAP6Bbsgw4xDn8Q7/pA5rPkXI=;
        b=Y7W77NHalzi2Et3/Uivs+x28RtuioITkaHaFt2OTWp3XeRzm9bIShCgqO7h+SNYGDO
         biBSnqnP8mI3hrfm/rv/aahfEa1SYZf8zKpxQDpANi0I5LsD3OnTvgeE9AEL67kAleQp
         IUvUkFT1nuRdFrV2hJDUEi9aKOy77nxUb7r2jBtb90pPU/+pkBSMHWLioxDvXc/z3vT5
         Ir2pTbyaNRbtDPeOka1hunzKmXHU/zBrWC71gJPAIuPvw7sPflVyBoODHGvz5prs/stw
         ODI7A2V29mb3rxY4xGvSpTB+EoowIbhpDm5DT89/kvYLVpEPFKQOIyZM4e6/jZqDVivh
         JH0Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=PCfgW/VCW9VYOGAwq2YAP6Bbsgw4xDn8Q7/pA5rPkXI=;
        b=mAq+6MBhSMfHdZZw1wwKURO766GM8bFR+PnXVnQGeiiY8uIh/bBDWSOwVjYXbJRMlC
         drwsnrVBjWOGxSWD2tHL1W/t1rTH1EQy2VR1PowBopvnnLTfLsVTrfg8W3X2H5WDvVPZ
         /Aw6eHF/2AtBn/HxnoHsl1Ro22OlEJBZTUCtHkydGwahaMXWxqpk5IsQGZQbs2pWOe7o
         KPacrvqmj8UPBjVR7lzkQc4q6I4yjWc+yMqPSi9CdoQN35O4tA23fcsZMLoXcjtHDMQx
         +HoaB4iRKsM2YME4ljs+zTXck5fqgBoCCHtNmxuUbxkVIfMm6pLqdAO5OOfjvos9LV4h
         VwNw==
X-Gm-Message-State: AOUpUlEVoNSqgea3URwGs+JRHnFzwrMZeYTODIoJ5r6UIWJMExOXzlt9
	fXHdd8wMw+lM26FqQT0tkb7gKf+XtU3CBgVoK5TLVA==
X-Google-Smtp-Source: AA+uWPzQc62v5A8FDUljiRRnTrRWPpvrudpP2utITKeqhCBngW2q7ZNtgPrCV7mmngOKoXYC+OSBDaPFk8NA8bmUgu4=
X-Received: by 2002:ab0:5b8f:: with SMTP id y15-v6mr31885298uae.59.1534843782210;
 Tue, 21 Aug 2018 02:29:42 -0700 (PDT)
MIME-Version: 1.0
References: <CAJwtRbxg7V06gk1PeQY7wVVJydP4D3avL+PMF2-b2_mbcAzV_A@mail.gmail.com>
 <CABWqe2b+qT_Z+1FP79HHQJVJdDz4WvDe8oFpbX+Ta5s=DrX=JQ@mail.gmail.com>
 <1651a05f250.276d.495a588ebf64bb63541fbe4ec3b29808@gmail.com>
 <CAFBpvfto33-WVzK_p_sS6tYzV-y-5O4THB9wZx1jnKXWwLf2ew@mail.gmail.com>
 <1651a161ef0.276d.495a588ebf64bb63541fbe4ec3b29808@gmail.com>
 <CABWqe2ZrevBYhB8jMFx6qSEQrXkv49Xmp65pN+kO0g43O-muRQ@mail.gmail.com> <CAFBpvfs2e_-7AF+SvL=-nMt28KPABdByqOBpE-LW-p2D5OhUJg@mail.gmail.com>
In-Reply-To: <CAFBpvfs2e_-7AF+SvL=-nMt28KPABdByqOBpE-LW-p2D5OhUJg@mail.gmail.com>
From: Cee Tee <c.turksema@gmail.com>
Date: Tue, 21 Aug 2018 11:29:30 +0200
Message-ID: <CAJwtRbyZUD+3oGLp9G1QpZzGCNwJbRe5=eqjTxs1pKSE9sX39Q@mail.gmail.com>
Subject: Re: Leader election failing
To: user@zookeeper.apache.org
Content-Type: multipart/alternative; boundary="0000000000006ca8a50573eeaa5d"

--0000000000006ca8a50573eeaa5d
Content-Type: text/plain; charset="UTF-8"

I've tested the patch and let it run 6 days. It did not help, result is
still the same. (remaining ZKs form islands based on datacenter they are
in).

I have mitigated it by doing a daily rolling restart.

Regards,
Chris

On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <andor@cloudera.com.invalid>
wrote:

> Hi Chris,
>
> Would you mind testing the following patch on your test clusters?
> I'm not entirely sure, but the issue might be related.
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>
> Regards,
> Andor
>
>
>
> On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <camille@apache.org>
> wrote:
>
> > If you have the time and inclination, next time you see this problem in
> > your test clusters get stack traces and any other diagnostics possible
> > before restarting. I'm not an expert at network debugging but if you have
> > someone who is you might want them to take a look at the connections and
> > settings of any switches/firewalls/etc involved, see if there's any
> unusual
> > configurations or evidence of other long-lived connections failing (even
> if
> > their services handle the failures more gracefully). Send us the stack
> > traces also it would be interesting to take a look.
> >
> > C
> >
> >
> > On Wed, Aug 8, 2018, 11:09 AM Chris <c.turksema@gmail.com> wrote:
> >
> > > Running 3.5.5
> > >
> > > I managed to recreate it on acc and test cluster today, failing on
> > > shutdown
> > > of leader. Both had been running for over a week. After restarting all
> > > zookeepers it runs fine no matter how many leader shutdowns i throw at
> > it.
> > >
> > > On 8 August 2018 5:05:34 pm Andor Molnar <andor@cloudera.com.INVALID>
> > > wrote:
> > >
> > > > Some kind of a network split?
> > > >
> > > > It looks like 1-2 and 3-4 were able to communicate each other, but
> > > > connection timed out between these 2 splits. When 5 came back online
> it
> > > > started with supporters of (1,2) and later 3 and 4 also joined.
> > > >
> > > > There was no such issue the day after.
> > > >
> > > > Which version of ZooKeeper is this? 3.5.something?
> > > >
> > > > Regards,
> > > > Andor
> > > >
> > > >
> > > >
> > > > On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turksema@gmail.com> wrote:
> > > >
> > > >> Actually i have similar issues on my test and acceptance clusters
> > where
> > > >> leader election fails if the cluster has been running for a couple
> of
> > > days.
> > > >> If you stop/start the Zookeepers once they will work fine on further
> > > >> disruptions that day. Not sure yet what the treshold is.
> > > >>
> > > >>
> > > >> On 8 August 2018 4:32:56 pm Camille Fournier <camille@apache.org>
> > > wrote:
> > > >>
> > > >> Hard to say. It looks like about 15 minutes after your first
> incident
> > > where
> > > >>> 5 goes down and then comes back up, servers 1 and 2 get socket
> errors
> > > to
> > > >>> their connections with 3, 4, and 6. It's possible if you had waited
> > > those
> > > >>> 15 minutes, once those errors cleared the quorum would've formed
> with
> > > the
> > > >>> other servers. But as for why there were those errors in the first
> > > place
> > > >>> it's not clear. Could be a network glitch, or an obscure bug in the
> > > >>> connection logic. Has anyone else ever seen this?
> > > >>> If you see it again, getting a stack trace of the servers when they
> > > can't
> > > >>> form quorum might be helpful.
> > > >>>
> > > >>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com>
> > wrote:
> > > >>>
> > > >>> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
> > > >>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
> > > >>>> Yesterday one of the participants (id5, by chance was the leader)
> > was
> > > >>>> rebooted. Although all other servers were online and not suffering
> > > from
> > > >>>> networking issues the leader election failed and the cluster
> > remained
> > > >>>> "looking" until the old leader came back online after which it was
> > > >>>> promptly
> > > >>>> elected as leader again.
> > > >>>>
> > > >>>> Today we tried the same exercise on the exact same servers, 5 was
> > > still
> > > >>>> leader and was rebooted, and leader election worked fine with 4 as
> > new
> > > >>>> leader.
> > > >>>>
> > > >>>> I have included the logs.  From the logs i see that yesterday 1,2
> > > never
> > > >>>> received new leader proposals from 3,4 and vice versa.
> > > >>>> Today all proposals came through. This is not the first time we've
> > > seen
> > > >>>> this type of behavior, where some zookeepers can't seem to find
> each
> > > >>>> other
> > > >>>> after the leader goes down.
> > > >>>> All servers use dynamic configuration and have the same config
> node.
> > > >>>>
> > > >>>> How could this be explained? These servers also host a replicated
> > > >>>> database
> > > >>>> cluster and have no history of db replication issues.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Chris
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > > >>
> > >
> > >
> > >
> > >
> >
>

--0000000000006ca8a50573eeaa5d--