zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Oddy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
Date Mon, 05 Feb 2018 09:52:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352198#comment-16352198

Jonathan Oddy commented on ZOOKEEPER-2930:

So, I think what happens is, if the 2nd node in the list dies in a way that causes new connections
to time out then the notification messages to the 3rd node are delayed by >=5s while those
to the 1st node are delivered on time. (sendNotifications() queues a notification to all three
nodes (including the local node), in order, and toSend() blocks during sending the message
to the 2nd node.)

This 5s delay means that if the 3rd node is elected, it will see the election complete >=
5s after the 1st node does. The 1st node attempts to connect to the 3rd on the leader port
5 times with a 1s delay (both hard coded) but, since the 3rd node hasn't seen the election
complete, it hasn't started listening on that port yet. Unless you're very lucky with timing,
the 1st node will give up and start a new election round before the 3rd realises that it has
been elected. The 3rd node then sits there for initLimit before going back to the LOOKING
state, leaving you with a broken cluster for at least initLimit.

My patch attempts to fix this by making the entire process of establishing a connection async,
avoiding it blocking toSend().

> Leader cannot be elected due to network timeout of some members.
> ----------------------------------------------------------------
>                 Key: ZOOKEEPER-2930
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.4.10, 3.5.3, 3.4.11, 3.5.4, 3.4.12
>         Environment: Java 8
> ZooKeeper 3.4.11(from github)
> Centos6.5
>            Reporter: Jiafu Jiang
>            Priority: Critical
>         Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:,
> ofs_zk2:,
> ofs_zk3:,
> I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command.
> It is supposed that the new Leader should be elected in some seconds, but the fact is,
ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new
> I change the log level to DEBUG (the default is INFO), and restart zookeeper servers
on ofs_zk1 and ofs_zk2 again, but it can not fix the problem.
> I read the log and the ZooKeeper source code, and I think I find the reason.
> When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()),
it will send notifications to all the servers. 
> When it fails to receive any notification during a timeout, it will resend the notifications,
and double the timeout. This process will repeat until any notification is received or the
timeout reaches a max value.
> The FastLeaderElection.sendNotifications() just put the notification message into a queue
and return. The WorkerSender is responsable to send the notifications.
> The WorkerSender just process the notifications one by one by passing the notifications
to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long
time when the notification is send to ofs_zk2(whose network is down) and some notifications
(which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications
by FastLeaderElection.sendNotifications() just make things worse.
> Here is the related source code:
> {code:java}
>     public void toSend(Long sid, ByteBuffer b) {
>         /*
>          * If sending message to myself, then simply enqueue it (loopback).
>          */
>         if (this.mySid == sid) {
>              b.position(0);
>              addToRecvQueue(new Message(b.duplicate(), sid));
>             /*
>              * Otherwise send to the corresponding thread to send.
>              */
>         } else {
>              /*
>               * Start a new connection if doesn't have one already.
>               */
>              ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
>              ArrayBlockingQueue<ByteBuffer> bqExisting = queueSendMap.putIfAbsent(sid,
>              if (bqExisting != null) {
>                  addToSendQueue(bqExisting, b);
>              } else {
>                  addToSendQueue(bq, b);
>              }
>              // This may block!!!
>              connectOne(sid);
>         }
>     }
> {code}
> Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the epoch ack,
but in fact the ofs_zk1 does not receive the notification(which says the leader is ofs_zk3)
because the ofs_zk3 has not sent the notification(which may still exist in the sendqueue of
WorkerSender). At last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout,
so it quits the leader and begins a new election. 
> The log files of ofs_zk1 and ofs_zk3 are attached.

This message was sent by Atlassian JIRA

View raw message