zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Sherman (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ZOOKEEPER-2783) follower disconnects and cannot reconnect
Date Sat, 13 May 2017 00:44:04 GMT
Ben Sherman created ZOOKEEPER-2783:

             Summary: follower disconnects and cannot reconnect
                 Key: ZOOKEEPER-2783
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2783
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection
    Affects Versions: 3.4.10
         Environment: centos 7, AWS EC2
            Reporter: Ben Sherman

We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well), and sometimes,
a node gets a read timeout, drops all the connections and tries to re-establish itself to
the quorum.  It can usually do this in a few seconds, but last night it took almost 15 minutes
to reconnect.

These are 5 servers in AWS, and we've tried tuning the timeouts, but the are exceeding any
reasonable timeout and still failing.

In the attached logs, 5 is a follower, 3 is the leader.  5 loses connectivity at 11:21:34.
 3 sees the disconnect at the same moment.

5 tries to re-establish the quorum, but cannot do it until the connections to the other servers
expire at 11:37:02.  After the connections are re-established, 5 connects immediately.

At 11:41:08, the operator restarted the server, and it reconnected normally.

I suspect there is a problem with stale connections to the rest of the quorum - the other
services on this box were fine (monitoring, puppet) and able to establish new connections
with no problems.

I posed this problem to the zookeeper-users list and was asked to open a ticket.

This message was sent by Atlassian JIRA

View raw message