zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuval Dori (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
Date Wed, 07 Mar 2018 09:30:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389307#comment-16389307

Yuval Dori commented on ZOOKEEPER-2172:



This issue happens in a few of our customers using 3.4.8 version.

During this days we are upgrading to 3.4.10.

As 3.5.3 is in Beta, is it possible to backport this fix?





> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>                 Key: ZOOKEEPER-2172
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.0
>         Environment: Ubuntu 12.04 + java 7
>            Reporter: Ziyou Wang
>            Assignee: Mohammad Arshad
>            Priority: Critical
>             Fix For: 3.5.3, 3.6.0
>         Attachments: ZOOKEEPER-2172-02.patch, ZOOKEEPER-2172-03.patch, ZOOKEEPER-2172-04.patch,
ZOOKEEPER-2172-06.patch, ZOOKEEPER-2172-07.patch, ZOOKEEPER-2172.patch, ZOOKEPER-2172-05.patch,
history.txt, node-1.log, node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log,
zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log,
zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.10000005d, zoo.cfg.dynamic.next,
zookeeper-1.log, zookeeper-1.out, zookeeper-2.log, zookeeper-2.out, zookeeper-3.log, zookeeper-3.out
> The operations are quite simple: start three zk servers one by one, then reconfig the
cluster to add the new one as a participant. When I add the  third one, the zk cluster may
enter a weird state and cannot recover.
>       I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547]
- Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48.
Latter, it logged “2015-04-20  12:53:52,230 [myid:1] - ERROR [LearnerHandler-/]
- Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231
[myid:1] - WARN  [LearnerHandler-/] - ******* GOODBYE  /
********”. From then on, the first node and second node rejected all client connections
and the third node didn’t join the cluster as a participant. The whole cluster was done.
>      When the problem happened, all three nodes just used the same dynamic config file
zoo.cfg.dynamic.10000005d which only contained the first two nodes. But there was another
unused dynamic config file in node-1 directory zoo.cfg.dynamic.next  which already contained
three nodes.
>      When I extended the waiting time between starting the third node and reconfiguring
the cluster, the problem didn’t show again. So it should be a race condition problem.

This message was sent by Atlassian JIRA

View raw message