From user-return-11604-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Wed Aug 8 16:52:16 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id AB9B8180600 for ; Wed, 8 Aug 2018 16:52:15 +0200 (CEST) Received: (qmail 64182 invoked by uid 500); 8 Aug 2018 14:52:14 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 64168 invoked by uid 99); 8 Aug 2018 14:52:13 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2018 14:52:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 6211FCCA1E for ; Wed, 8 Aug 2018 14:52:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.111 X-Spam-Level: X-Spam-Status: No, score=-0.111 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id GqZkhALzhn2o for ; Wed, 8 Aug 2018 14:52:12 +0000 (UTC) Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 40C895F492 for ; Wed, 8 Aug 2018 14:52:12 +0000 (UTC) Received: by mail-ed1-f47.google.com with SMTP id x5-v6so1419151edr.0 for ; Wed, 08 Aug 2018 07:52:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:date:message-id:in-reply-to:references:user-agent:subject :mime-version:content-transfer-encoding; bh=imvxdEomiHM3dg5IcYIuHTZjTf0Z+e1+9fEmb9oFwFY=; b=M9hSjEKAPh+KECGL5fETdvJJKq4uxjaHwlgV32UURJBV8+jEEY2xOhcWbebYsIIIVU 1Yh7L2gSBFp1cJvGhyqVGH6Uqf8SBX5ZJK6zt6lOXdoUPxVMCVWD+G9k5SmsnsHyMUzj XSY+gk/3Rg30wpEka8XdCKr4maLI3C/Zih1CzLjon8Lv2+xSQ6lOLz+YWQ9BP+TXihUp SKGkCOmVFZUpFJsBrt9gsxe6VOc/w3kzDbJpmeD/cTOHFnHV5owmj4Z52I8w5sNk9J6z ts7ZsvHnrSkkaReSUkZQDeZkdJ+odxhv5xBI8kHBscUHxNZtiOYSLaf126/o2SDoFQRs 9CKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:date:message-id:in-reply-to:references :user-agent:subject:mime-version:content-transfer-encoding; bh=imvxdEomiHM3dg5IcYIuHTZjTf0Z+e1+9fEmb9oFwFY=; b=lE4uvwPjbFugir9Lqq/xbXlfCc2y79pOKqMJNcDBl64SpbQcd4csqElOpY6qhYbCku Tv+ZbJIFE09qRWSFcpymCv2iBD0SnMw4FLV61ZlMIlM5jnmDKIFMkGyAOLX/JjRvPojK +upgMZkkei9D//Dk6hQc8fRtwYdpTcJuveJKM43E0Un4hH5553nLXOZsY521bGWc7w7u pJbGR8YhhO9YTReuSEkUUyyycQIKp1p8ZkYGB9yS7SSkwPHeevOdDG9Rn0nr9AFSKRfm DCFn5xB5mENwawEsOH146ovxyyqCT10r4ifgW61E+2GTxuifsCBafVwuViUZoTosOY4J Kd1Q== X-Gm-Message-State: AOUpUlGjCurxOg3En4+aU94RPuOzT/PsaQq8TDVE1ZiKvyBkUqrUxaou E8rbmPmJa4qT+5cBlxJFcsGfr8p6 X-Google-Smtp-Source: AA+uWPxSM+3vzMllNxL0MKE3vlXbHtoyKhUDsZmBS218aEQq9w7hlxhB5elcujJmLGVE41HrdYLf8w== X-Received: by 2002:a50:ba6e:: with SMTP id 43-v6mr3753504eds.292.1533739925683; Wed, 08 Aug 2018 07:52:05 -0700 (PDT) Received: from [192.168.1.3] (92-108-85-166.cable.dynamic.v4.ziggo.nl. [92.108.85.166]) by smtp.gmail.com with ESMTPSA id w3-v6sm4955291edb.16.2018.08.08.07.52.04 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 08 Aug 2018 07:52:05 -0700 (PDT) From: Chris To: Date: Wed, 08 Aug 2018 16:52:03 +0200 Message-ID: <1651a05f250.276d.495a588ebf64bb63541fbe4ec3b29808@gmail.com> In-Reply-To: References: User-Agent: AquaMail/1.16.0-1193 (build: 101600006) Subject: Re: Leader election failing MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="us-ascii" Content-Transfer-Encoding: 8bit Actually i have similar issues on my test and acceptance clusters where leader election fails if the cluster has been running for a couple of days. If you stop/start the Zookeepers once they will work fine on further disruptions that day. Not sure yet what the treshold is. On 8 August 2018 4:32:56 pm Camille Fournier wrote: > Hard to say. It looks like about 15 minutes after your first incident where > 5 goes down and then comes back up, servers 1 and 2 get socket errors to > their connections with 3, 4, and 6. It's possible if you had waited those > 15 minutes, once those errors cleared the quorum would've formed with the > other servers. But as for why there were those errors in the first place > it's not clear. Could be a network glitch, or an obscure bug in the > connection logic. Has anyone else ever seen this? > If you see it again, getting a stack trace of the servers when they can't > form quorum might be helpful. > > On Wed, Aug 8, 2018 at 11:52 AM Cee Tee wrote: > >> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6). >> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B. >> Yesterday one of the participants (id5, by chance was the leader) was >> rebooted. Although all other servers were online and not suffering from >> networking issues the leader election failed and the cluster remained >> "looking" until the old leader came back online after which it was promptly >> elected as leader again. >> >> Today we tried the same exercise on the exact same servers, 5 was still >> leader and was rebooted, and leader election worked fine with 4 as new >> leader. >> >> I have included the logs. From the logs i see that yesterday 1,2 never >> received new leader proposals from 3,4 and vice versa. >> Today all proposals came through. This is not the first time we've seen >> this type of behavior, where some zookeepers can't seem to find each other >> after the leader goes down. >> All servers use dynamic configuration and have the same config node. >> >> How could this be explained? These servers also host a replicated database >> cluster and have no history of db replication issues. >> >> Thanks, >> Chris >> >> >>