zookeeper-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Dürr (Jira) <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing
Date Mon, 30 Sep 2019 10:55:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940855#comment-16940855
] 

Michael Dürr commented on ZOOKEEPER-2164:
-----------------------------------------

Same problem here:
 * Simple 3 node cluster (zoo1, zoo2, zoo3). zoo1 is leader
 * Shutting down and restarting zoo1 will not result in a proper cluster state:

{code:bash}
$ echo "stat" | nc zoo1 2181
This ZooKeeper instance is not currently serving requests
{code}

{code:bash}
$ echo "stat" | nc zoo2 2181
Zookeeper version: 3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07
GMT
Clients:
 /172.22.0.1:45764[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/0/0
Received: 3
Sent: 2
Connections: 1
Outstanding: 0
Zxid: 0x1000000009
Mode: follower
Node count: 401
{code}

{code:bash}
$ echo "stat" | nc zoo3 2181
Zookeeper version: 3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07
GMT
Clients:
 /172.22.0.1:53132[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/0/0
Received: 2
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x1400000000
Mode: leader
Node count: 401
Proposal sizes last/min/max: -1/-1/-1
{code}

> fast leader election keeps failing
> ----------------------------------
>
>                 Key: ZOOKEEPER-2164
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>            Reporter: Michi Mutsuzaki
>            Priority: Major
>             Fix For: 3.6.0, 3.5.7
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut
down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for
5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while,
so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message