From user-return-11603-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org  Wed Aug  8 16:32:52 2018
Return-Path: <user-return-11603-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id DD82E180600
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  8 Aug 2018 16:32:51 +0200 (CEST)
Received: (qmail 3894 invoked by uid 500); 8 Aug 2018 14:32:50 -0000
Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@zookeeper.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@zookeeper.apache.org>
List-Post: <mailto:user@zookeeper.apache.org>
List-Id: <user.zookeeper.apache.org>
Reply-To: user@zookeeper.apache.org
Delivered-To: mailing list user@zookeeper.apache.org
Received: (qmail 3883 invoked by uid 99); 8 Aug 2018 14:32:50 -0000
Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2018 14:32:50 +0000
Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48])
	by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id A2A9D1F78
	for <user@zookeeper.apache.org>; Wed,  8 Aug 2018 14:32:49 +0000 (UTC)
Received: by mail-lf1-f48.google.com with SMTP id g6-v6so1726613lfb.11
        for <user@zookeeper.apache.org>; Wed, 08 Aug 2018 07:32:49 -0700 (PDT)
X-Gm-Message-State: AOUpUlHlkzK4TaDKDcwZRmvr1MVFCKhhZdElUFkIT+5dNS+MVY2vVXqi
	aIOXFNoYfxcB5vOYjCwtNlW6zaPBoorVV1V/pCc=
X-Google-Smtp-Source: AA+uWPzAgayCD8i8DOQNsrJpMHXMdRxaUFHA++RR/MX2qcvkua8TqbsVOFbTw63PZqdC5kQbCxjWmRarq2rQhWQsP9c=
X-Received: by 2002:a19:501e:: with SMTP id e30-v6mr2075377lfb.71.1533738768500;
 Wed, 08 Aug 2018 07:32:48 -0700 (PDT)
MIME-Version: 1.0
References: <CAJwtRbxg7V06gk1PeQY7wVVJydP4D3avL+PMF2-b2_mbcAzV_A@mail.gmail.com>
In-Reply-To: <CAJwtRbxg7V06gk1PeQY7wVVJydP4D3avL+PMF2-b2_mbcAzV_A@mail.gmail.com>
From: Camille Fournier <camille@apache.org>
Date: Wed, 8 Aug 2018 14:32:36 +0000
X-Gmail-Original-Message-ID: <CABWqe2b+qT_Z+1FP79HHQJVJdDz4WvDe8oFpbX+Ta5s=DrX=JQ@mail.gmail.com>
Message-ID: <CABWqe2b+qT_Z+1FP79HHQJVJdDz4WvDe8oFpbX+Ta5s=DrX=JQ@mail.gmail.com>
Subject: Re: Leader election failing
To: "user@zookeeper.apache.org" <user@zookeeper.apache.org>
Content-Type: multipart/alternative; boundary="00000000000079919f0572ed62de"

--00000000000079919f0572ed62de
Content-Type: text/plain; charset="UTF-8"

Hard to say. It looks like about 15 minutes after your first incident where
5 goes down and then comes back up, servers 1 and 2 get socket errors to
their connections with 3, 4, and 6. It's possible if you had waited those
15 minutes, once those errors cleared the quorum would've formed with the
other servers. But as for why there were those errors in the first place
it's not clear. Could be a network glitch, or an obscure bug in the
connection logic. Has anyone else ever seen this?
If you see it again, getting a stack trace of the servers when they can't
form quorum might be helpful.

On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com> wrote:

> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
> Yesterday one of the participants (id5, by chance was the leader) was
> rebooted. Although all other servers were online and not suffering from
> networking issues the leader election failed and the cluster remained
> "looking" until the old leader came back online after which it was promptly
> elected as leader again.
>
> Today we tried the same exercise on the exact same servers, 5 was still
> leader and was rebooted, and leader election worked fine with 4 as new
> leader.
>
> I have included the logs.  From the logs i see that yesterday 1,2 never
> received new leader proposals from 3,4 and vice versa.
> Today all proposals came through. This is not the first time we've seen
> this type of behavior, where some zookeepers can't seem to find each other
> after the leader goes down.
> All servers use dynamic configuration and have the same config node.
>
> How could this be explained? These servers also host a replicated database
> cluster and have no history of db replication issues.
>
> Thanks,
> Chris
>
>
>

--00000000000079919f0572ed62de--