From issues-return-986-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org  Fri Sep  6 15:43:16 2019
Return-Path: <issues-return-986-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 054A1180677
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  6 Sep 2019 17:43:15 +0200 (CEST)
Received: (qmail 15837 invoked by uid 500); 7 Sep 2019 03:21:08 -0000
Mailing-List: contact issues-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@zookeeper.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@zookeeper.apache.org>
List-Post: <mailto:issues@zookeeper.apache.org>
List-Id: <issues.zookeeper.apache.org>
Reply-To: dev@zookeeper.apache.org
Delivered-To: mailing list issues@zookeeper.apache.org
Received: (qmail 15719 invoked by uid 99); 7 Sep 2019 03:21:07 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Sep 2019 03:21:07 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 0F817E313B
	for <issues@zookeeper.apache.org>; Fri,  6 Sep 2019 15:43:14 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id C32897823B8
	for <issues@zookeeper.apache.org>; Fri,  6 Sep 2019 15:43:09 +0000 (UTC)
Date: Fri, 6 Sep 2019 15:43:09 +0000 (UTC)
From: "Enrico Olivelli (Jira)" <jira@apache.org>
To: issues@zookeeper.apache.org
Message-ID: <JIRA.12690156.1390351910000.15607.1567784589798@Atlassian.JIRA>
In-Reply-To: <JIRA.12690156.1390351910000@Atlassian.JIRA>
References: <JIRA.12690156.1390351910000@Atlassian.JIRA> <JIRA.12690156.1390351910425@jira-he-de>
Subject: [jira] [Updated] (ZOOKEEPER-1865) Fix retry logic in
 Learner.connectToLeader()
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enrico Olivelli updated ZOOKEEPER-1865:
---------------------------------------
    Fix Version/s: 3.5.7

> Fix retry logic in Learner.connectToLeader() 
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-1865
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Thawan Kooburat
>            Assignee: Edward Carter
>            Priority: Major
>             Fix For: 3.6.0, 3.5.6, 3.5.7
>
>         Attachments: ZOOKEEPER-1865-nanoTime.patch, ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. So 3 out 5 (including the old leader) elected the old leader to be a new leader for the next epoch. While, the old leader is being rebooted, 2 other machines are trying to connect to the old leader.  So the quorum couldn't form until those 2 machines give up and move to the next round of leader election.
> This is because Learner.connectToLeader() use a simple retry logic. The contract for this method is that it should never spend longer that initLimit trying to connect to the leader.  In our outage, each sock.connect() is probably blocked for initLimit and it is called 5 times.


--
This message was sent by Atlassian Jira
(v8.3.2#803003)