From user-return-11603-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Wed Aug 8 16:32:52 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id DD82E180600 for ; Wed, 8 Aug 2018 16:32:51 +0200 (CEST) Received: (qmail 3894 invoked by uid 500); 8 Aug 2018 14:32:50 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 3883 invoked by uid 99); 8 Aug 2018 14:32:50 -0000 Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2018 14:32:50 +0000 Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id A2A9D1F78 for ; Wed, 8 Aug 2018 14:32:49 +0000 (UTC) Received: by mail-lf1-f48.google.com with SMTP id g6-v6so1726613lfb.11 for ; Wed, 08 Aug 2018 07:32:49 -0700 (PDT) X-Gm-Message-State: AOUpUlHlkzK4TaDKDcwZRmvr1MVFCKhhZdElUFkIT+5dNS+MVY2vVXqi aIOXFNoYfxcB5vOYjCwtNlW6zaPBoorVV1V/pCc= X-Google-Smtp-Source: AA+uWPzAgayCD8i8DOQNsrJpMHXMdRxaUFHA++RR/MX2qcvkua8TqbsVOFbTw63PZqdC5kQbCxjWmRarq2rQhWQsP9c= X-Received: by 2002:a19:501e:: with SMTP id e30-v6mr2075377lfb.71.1533738768500; Wed, 08 Aug 2018 07:32:48 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Camille Fournier Date: Wed, 8 Aug 2018 14:32:36 +0000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Leader election failing To: "user@zookeeper.apache.org" Content-Type: multipart/alternative; boundary="00000000000079919f0572ed62de" --00000000000079919f0572ed62de Content-Type: text/plain; charset="UTF-8" Hard to say. It looks like about 15 minutes after your first incident where 5 goes down and then comes back up, servers 1 and 2 get socket errors to their connections with 3, 4, and 6. It's possible if you had waited those 15 minutes, once those errors cleared the quorum would've formed with the other servers. But as for why there were those errors in the first place it's not clear. Could be a network glitch, or an obscure bug in the connection logic. Has anyone else ever seen this? If you see it again, getting a stack trace of the servers when they can't form quorum might be helpful. On Wed, Aug 8, 2018 at 11:52 AM Cee Tee wrote: > I have a cluster of 5 participants (id 1-5) and 1 observer (id 6). > 1,2,5 are in datacenter A. 3,4,6 are in datacenter B. > Yesterday one of the participants (id5, by chance was the leader) was > rebooted. Although all other servers were online and not suffering from > networking issues the leader election failed and the cluster remained > "looking" until the old leader came back online after which it was promptly > elected as leader again. > > Today we tried the same exercise on the exact same servers, 5 was still > leader and was rebooted, and leader election worked fine with 4 as new > leader. > > I have included the logs. From the logs i see that yesterday 1,2 never > received new leader proposals from 3,4 and vice versa. > Today all proposals came through. This is not the first time we've seen > this type of behavior, where some zookeepers can't seem to find each other > after the leader goes down. > All servers use dynamic configuration and have the same config node. > > How could this be explained? These servers also host a replicated database > cluster and have no history of db replication issues. > > Thanks, > Chris > > > --00000000000079919f0572ed62de--