From user-return-11654-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Tue Sep 11 11:09:10 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4FB1618065B for ; Tue, 11 Sep 2018 11:09:09 +0200 (CEST) Received: (qmail 34517 invoked by uid 500); 11 Sep 2018 09:09:08 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 34505 invoked by uid 99); 11 Sep 2018 09:09:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2018 09:09:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 17840CCEA7 for ; Tue, 11 Sep 2018 09:09:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.899 X-Spam-Level: * X-Spam-Status: No, score=1.899 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id zgvuJZ1e6qti for ; Tue, 11 Sep 2018 09:09:05 +0000 (UTC) Received: from mail-vk0-f49.google.com (mail-vk0-f49.google.com [209.85.213.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 009585F3AD for ; Tue, 11 Sep 2018 09:09:04 +0000 (UTC) Received: by mail-vk0-f49.google.com with SMTP id b14-v6so1449674vke.13 for ; Tue, 11 Sep 2018 02:09:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=HhkJTM8f5jLxhJrQFHuKhOnBHEROFpQrvfAm/Bdaztw=; b=LpWYHMn2Dt/i0LKMcyLAcroWwk27pD71KGgCAN8gHPQg41GJSlCTwqJajlSwyf/P+f uGgVgzyf5+P8N95LWOdiUD6WBSCedp0DXt7LJ1b0BRYzjvJAchebKhJV1vut0LyJBHYp Iq8bDdxMmAb6Vfvr15EkZakHH6bvci191ppsuCGPwS/imHk/ZNZKZzPnQZVJMVbMBSdH 2JBbfnkse1VOe+l1oYN675EfT/KE08SEADDsZgvYxyjEqhHrL+3LlOHROCu1IGd6zwRu paXez01sDgCYr9nAwCFP/B6Yg1pPhT+mzu+dfOatu9rS+jPTSxF2YfSveXsuXlC/+5d8 ZICw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=HhkJTM8f5jLxhJrQFHuKhOnBHEROFpQrvfAm/Bdaztw=; b=PJqnpqE6O7IhCe2xt4LfThwgrTz1o0MiFWgtkvsxWCjAguftao98SC2mCnmh926go3 j0xKcstWZOC8Jh0/GX0FZ/5ajbjBF2zyOVC/pLJZ6Q3FpKTwxmfwkvlcEOZeqTDGYCtY Nn408ISuo6XzUeH/U5unU0a+xkBP4Laid4Z/XWvZ6mFKsOCAZrKcxLV6Xmsv/u/gERgp T/eNoIRgMkp6RJGDTYVR+OSj9UlXf4//0N5oHNesiRl7qS9+3nZuR5OgLeFCmdp0xLOe rqfYKP8x/GxNkBSh6nM+lpRgyxbflOw44LB/oNHcluPLZDI1JWtnHi0KWLXre2Orc8y2 iRNQ== X-Gm-Message-State: APzg51CGqVmT8vBq6PCOkW7+pgTpBiIc6ZSY1TMolU2qpDToNgBaxsFx M3t8wzGVHho6B+DFFnUGbNyS3wp7cr5KmAQZe5jrJg== X-Google-Smtp-Source: ANB0Vdb7EyiPEQ3zUdBKj2MR8V3Pt6vf7gLzi75xDl3RikbDh40F/zAyiDdI9FaSOebJnZmZeGok8htokwc+xHw7StI= X-Received: by 2002:a1f:3e47:: with SMTP id l68-v6mr8384705vka.42.1536656943620; Tue, 11 Sep 2018 02:09:03 -0700 (PDT) MIME-Version: 1.0 References: <1651a05f250.276d.495a588ebf64bb63541fbe4ec3b29808@gmail.com> <1651a161ef0.276d.495a588ebf64bb63541fbe4ec3b29808@gmail.com> In-Reply-To: From: Cee Tee Date: Tue, 11 Sep 2018 11:08:51 +0200 Message-ID: Subject: Re: Leader election failing To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary="000000000000442693057594d389" --000000000000442693057594d389 Content-Type: text/plain; charset="UTF-8" Concluded a test with a 3.4.13 cluster, it shows the same behaviour. On Mon, Sep 3, 2018 at 4:56 PM Andor Molnar wrote: > Thanks for testing Chris. > > So, if I understand you correctly, you're running the latest version from > branch-3.5. Could we say that this is a 3.5-only problem? > Have you ever tested the same cluster with 3.4? > > Regards, > Andor > > > > On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee wrote: > > > I've tested the patch and let it run 6 days. It did not help, result is > > still the same. (remaining ZKs form islands based on datacenter they are > > in). > > > > I have mitigated it by doing a daily rolling restart. > > > > Regards, > > Chris > > > > On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar > > > wrote: > > > > > Hi Chris, > > > > > > Would you mind testing the following patch on your test clusters? > > > I'm not entirely sure, but the issue might be related. > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > > > > > > Regards, > > > Andor > > > > > > > > > > > > On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier > > > wrote: > > > > > > > If you have the time and inclination, next time you see this problem > in > > > > your test clusters get stack traces and any other diagnostics > possible > > > > before restarting. I'm not an expert at network debugging but if you > > have > > > > someone who is you might want them to take a look at the connections > > and > > > > settings of any switches/firewalls/etc involved, see if there's any > > > unusual > > > > configurations or evidence of other long-lived connections failing > > (even > > > if > > > > their services handle the failures more gracefully). Send us the > stack > > > > traces also it would be interesting to take a look. > > > > > > > > C > > > > > > > > > > > > On Wed, Aug 8, 2018, 11:09 AM Chris wrote: > > > > > > > > > Running 3.5.5 > > > > > > > > > > I managed to recreate it on acc and test cluster today, failing on > > > > > shutdown > > > > > of leader. Both had been running for over a week. After restarting > > all > > > > > zookeepers it runs fine no matter how many leader shutdowns i throw > > at > > > > it. > > > > > > > > > > On 8 August 2018 5:05:34 pm Andor Molnar > > > > > > > > wrote: > > > > > > > > > > > Some kind of a network split? > > > > > > > > > > > > It looks like 1-2 and 3-4 were able to communicate each other, > but > > > > > > connection timed out between these 2 splits. When 5 came back > > online > > > it > > > > > > started with supporters of (1,2) and later 3 and 4 also joined. > > > > > > > > > > > > There was no such issue the day after. > > > > > > > > > > > > Which version of ZooKeeper is this? 3.5.something? > > > > > > > > > > > > Regards, > > > > > > Andor > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 8, 2018 at 4:52 PM, Chris > > wrote: > > > > > > > > > > > >> Actually i have similar issues on my test and acceptance > clusters > > > > where > > > > > >> leader election fails if the cluster has been running for a > couple > > > of > > > > > days. > > > > > >> If you stop/start the Zookeepers once they will work fine on > > further > > > > > >> disruptions that day. Not sure yet what the treshold is. > > > > > >> > > > > > >> > > > > > >> On 8 August 2018 4:32:56 pm Camille Fournier < > camille@apache.org> > > > > > wrote: > > > > > >> > > > > > >> Hard to say. It looks like about 15 minutes after your first > > > incident > > > > > where > > > > > >>> 5 goes down and then comes back up, servers 1 and 2 get socket > > > errors > > > > > to > > > > > >>> their connections with 3, 4, and 6. It's possible if you had > > waited > > > > > those > > > > > >>> 15 minutes, once those errors cleared the quorum would've > formed > > > with > > > > > the > > > > > >>> other servers. But as for why there were those errors in the > > first > > > > > place > > > > > >>> it's not clear. Could be a network glitch, or an obscure bug in > > the > > > > > >>> connection logic. Has anyone else ever seen this? > > > > > >>> If you see it again, getting a stack trace of the servers when > > they > > > > > can't > > > > > >>> form quorum might be helpful. > > > > > >>> > > > > > >>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee > > > > wrote: > > > > > >>> > > > > > >>> I have a cluster of 5 participants (id 1-5) and 1 observer (id > > 6). > > > > > >>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B. > > > > > >>>> Yesterday one of the participants (id5, by chance was the > > leader) > > > > was > > > > > >>>> rebooted. Although all other servers were online and not > > suffering > > > > > from > > > > > >>>> networking issues the leader election failed and the cluster > > > > remained > > > > > >>>> "looking" until the old leader came back online after which it > > was > > > > > >>>> promptly > > > > > >>>> elected as leader again. > > > > > >>>> > > > > > >>>> Today we tried the same exercise on the exact same servers, 5 > > was > > > > > still > > > > > >>>> leader and was rebooted, and leader election worked fine with > 4 > > as > > > > new > > > > > >>>> leader. > > > > > >>>> > > > > > >>>> I have included the logs. From the logs i see that yesterday > > 1,2 > > > > > never > > > > > >>>> received new leader proposals from 3,4 and vice versa. > > > > > >>>> Today all proposals came through. This is not the first time > > we've > > > > > seen > > > > > >>>> this type of behavior, where some zookeepers can't seem to > find > > > each > > > > > >>>> other > > > > > >>>> after the leader goes down. > > > > > >>>> All servers use dynamic configuration and have the same config > > > node. > > > > > >>>> > > > > > >>>> How could this be explained? These servers also host a > > replicated > > > > > >>>> database > > > > > >>>> cluster and have no history of db replication issues. > > > > > >>>> > > > > > >>>> Thanks, > > > > > >>>> Chris > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --000000000000442693057594d389--