zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igor Skokov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ZOOKEEPER-3320) Don't give up on bind of leader election port
Date Sun, 17 Mar 2019 08:22:00 GMT
Igor Skokov created ZOOKEEPER-3320:

             Summary: Don't give up on bind of leader election port
                 Key: ZOOKEEPER-3320
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3320
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection
    Affects Versions: 3.5.4, 3.4.10
            Reporter: Igor Skokov

When trying to run Zookeeper 3.5.4 cluster on Kubernetes, I found out that in some circumstances
Zookeeper node stop listening on leader election port. This cause unavailability of ZK cluster.

Zookeeper deployed  as StatefulSet in term of Kubernetes and has following dynamic configuration:


Bind address contains DNS name which generated by Kubernetes for each StatefulSet pod.
These DNS names will become resolvable after container start, but with some delay. That delay
cause stopping of leader election port listener in QuorumCnxManager.Listener class.
Error happens in QuorumCnxManager.Listener "run" method, it tries to bind leader election
port to hostname which not resolvable at this moment. Retry count is hard-coded and equals
to 3(with backoff of 1 sec). 

Zookeeper server log contains following errors:

2019-03-17 07:56:04,844 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/]
- Unexpected exception
java.net.SocketException: Unresolved address
	at java.base/java.net.ServerSocket.bind(ServerSocket.java:374)
	at java.base/java.net.ServerSocket.bind(ServerSocket.java:335)
	at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:241)
	at org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1023)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1226)
2019-03-17 07:56:04,844 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/]
- PeerState set to LOOKING
2019-03-17 07:56:04,845 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/]
2019-03-17 07:56:04,845 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/]
- New election. My id =  1, proposed zxid=0x0
2019-03-17 07:56:04,846 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@687] -
Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0xf (n.round), LOOKING
(n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2019-03-17 07:56:04,979 [myid:1] - INFO  [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@892]
- Leaving listener
2019-03-17 07:56:04,979 [myid:1] - ERROR [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@894]
- As I'm leaving the listener thread, I won't be able to participate in leader election any
longer: zookeeper-0.zookeeper:2183

This error happens on most nodes on cluster start and Zookeeper is unable to form quorum.
This will leave cluster in unusable state.
As I can see, error present on branches 3.4 and 3.5. 
I think, this error can be fixed by configurable number of retries(instead of hard-coded value
of 3). 
Other way to fix this is removing of max retries at all. Currently, ZK server only stop leader
election listener and continue to serve on other ports. May be, if leader election halts,
we should abort process.

This message was sent by Atlassian JIRA

View raw message