zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael K. Edwards (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()
Date Wed, 21 Nov 2018 23:32:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695365#comment-16695365

Michael K. Edwards commented on ZOOKEEPER-1865:

Is this reproducible in current 3.5?

> Fix retry logic in Learner.connectToLeader() 
> ---------------------------------------------
>                 Key: ZOOKEEPER-1865
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Thawan Kooburat
>            Assignee: Edward Carter
>            Priority: Major
>             Fix For: 3.6.0, 3.5.5
>         Attachments: ZOOKEEPER-1865-nanoTime.patch, ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. So 3 out
5 (including the old leader) elected the old leader to be a new leader for the next epoch.
While, the old leader is being rebooted, 2 other machines are trying to connect to the old
leader.  So the quorum couldn't form until those 2 machines give up and move to the next round
of leader election.
> This is because Learner.connectToLeader() use a simple retry logic. The contract for
this method is that it should never spend longer that initLimit trying to connect to the leader.
 In our outage, each sock.connect() is probably blocked for initLimit and it is called 5 times.

This message was sent by Atlassian JIRA

View raw message