zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Nixon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-3320) Leader election port stop listen when hostname unresolvable for some time
Date Tue, 19 Mar 2019 17:50:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796341#comment-16796341

Brian Nixon commented on ZOOKEEPER-3320:

This is an interesting error case!

I would expect an issue in QuorumCnxManager to bring the peer down if it cannot create the
socket but it seems this only occurs with a BindException and not a generic SocketException.
At the least, I think we ought to fix that.

Looking at this from the opposite direction, can you add the desired delay in the startup
sequence of your Kubernetes container? My concern is that the pattern of "DNS is currently
unreliable but will be reliable soon" seems specific to the container management and may result
in strange behavior when applied to other environments.

> Leader election port stop listen when hostname unresolvable for some time 
> --------------------------------------------------------------------------
>                 Key: ZOOKEEPER-3320
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3320
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.10, 3.5.4
>            Reporter: Igor Skokov
>            Priority: Major
> When trying to run Zookeeper 3.5.4 cluster on Kubernetes, I found out that in some circumstances
Zookeeper node stop listening on leader election port. This cause unavailability of ZK cluster.

> Zookeeper deployed  as StatefulSet in term of Kubernetes and has following dynamic configuration:
> {code:java}
> zookeeper-0.zookeeper:2182:2183:participant;2181
> zookeeper-1.zookeeper:2182:2183:participant;2181
> zookeeper-2.zookeeper:2182:2183:participant;2181
> {code}
> Bind address contains DNS name which generated by Kubernetes for each StatefulSet pod.
> These DNS names will become resolvable after container start, but with some delay. That
delay cause stopping of leader election port listener in QuorumCnxManager.Listener class.
> Error happens in QuorumCnxManager.Listener "run" method, it tries to bind leader election
port to hostname which not resolvable at this moment. Retry count is hard-coded and equals
to 3(with backoff of 1 sec). 
> Zookeeper server log contains following errors:
> {code:java}
> 2019-03-17 07:56:04,844 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/]
- Unexpected exception
> java.net.SocketException: Unresolved address
> 	at java.base/java.net.ServerSocket.bind(ServerSocket.java:374)
> 	at java.base/java.net.ServerSocket.bind(ServerSocket.java:335)
> 	at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:241)
> 	at org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1023)
> 	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1226)
> 2019-03-17 07:56:04,844 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/]
- PeerState set to LOOKING
> 2019-03-17 07:56:04,845 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/]
> 2019-03-17 07:56:04,845 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/]
- New election. My id =  1, proposed zxid=0x0
> 2019-03-17 07:56:04,846 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@687]
- Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0xf (n.round), LOOKING
(n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2019-03-17 07:56:04,979 [myid:1] - INFO  [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@892]
- Leaving listener
> 2019-03-17 07:56:04,979 [myid:1] - ERROR [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@894]
- As I'm leaving the listener thread, I won't be able to participate in leader election any
longer: zookeeper-0.zookeeper:2183
> {code}
> This error happens on most nodes on cluster start and Zookeeper is unable to form quorum.
This will leave cluster in unusable state.
> As I can see, error present on branches 3.4 and 3.5. 
> I think, this error can be fixed by configurable number of retries(instead of hard-coded
value of 3). 
> Other way to fix this is removing of max retries at all. Currently, ZK server only stop
leader election listener and continue to serve on other ports. Maybe, if leader election halts,
we should abort process.

This message was sent by Atlassian JIRA

View raw message