Hi, Anand, I took a look and I wonder how do you explain the following The N1.log starts at around 2016-10-03 12:37:38,073 [myid:1] - INFO [QuorumPeer[myid=1]/127.0.0.1:5002:QuorumPeer@714] - LOOKING and it failed to talk to Node 2 (id: 2) 2016-10-03 12:37:38,136 [myid:1] - WARN [WorkerSender[myid=1]:QuorumCnxManager@382] - Cannot open channel to 2 at election address node2.controller.local/127.0.0.8:5004 java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) However, after failed to get any election messages from Node 3, it started a new round but then I seem to see the election messages from Node 2 again. Any idea why? 2016-10-03 12:38:32,815 [myid:1] - INFO [QuorumPeer[myid=1]/127.0.0.1:5002:QuorumPeer@714] - LOOKING 2016-10-03 12:38:32,816 [myid:1] - INFO [QuorumPeer[myid=1]/127.0.0.1:5002:FileSnap@83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.20002aa0b 2016-10-03 12:38:32,875 [myid:1] - INFO [QuorumPeer[myid=1]/127.0.0.1:5002:FastLeaderElection@815] - New election. My id = 1, proposed zxid=0x20002bfb5 2016-10-03 12:38:32,876 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:32,878 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 2 (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 2 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:32,878 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 2 (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:32,917 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 3 (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:33,117 [myid:1] - INFO [QuorumPeer[myid=1]/127.0.0.1:5002:FastLeaderElection@849] - Notification time out: 400 2016-10-03 12:38:33,118 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 2 (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:33,159 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 3 (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:33,559 [myid:1] - INFO [QuorumPeer[myid=1]/127.0.0.1:5002:FastLeaderElection@849] - Notification time out: 800 2016-10-03 12:38:33,560 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 2 (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:33,561 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 2 (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LEADING (n.state), 2 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:33,601 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 3 (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3 (n.sid), 0x2 (n.peerEpoch) LOOKING (my state) 2016-10-03 12:38:34,402 [myid:1] - INFO [QuorumPeer[myid=1]/127.0.0.1:5002:FastLeaderElection@849] - Notification time out: 1600 On Oct 8, 2016, at 10:05 AM, Anand Parthasarathy > wrote: Hi Flavio, I have attached the logs from node 1 and node 3. Node 2 was powered off around 10-03 12:36. Leader election kept going until 10-03 15:57:16 when it finally converged. Thanks, Anand. On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira > wrote: Hi Anand, I don't understand whether 1 and 3 were able or even trying to connect to each other. They should be able to elect a leader between them and make progress. You might want to upload logs and let us know. -Flavio > On 08 Oct 2016, at 02:11, Anand Parthasarathy > wrote: > > Hi, > > We are currently using zookeeper 3.4.6 version and use a 3 node solution in > our system. We see that occasionally, when a node is powered off (in this > instance, it was actually a leader node), the remaining two nodes do not > form a quorum for a really long time. Looking at the logs, it appears the > sequence is as follows: > - Node 2 is the zookeeper leader > - Node 2 is powered off > - Node 1 and Node 3 recognize and start the election > - Node 3 times out after initLimit * tickTime with "Timeout while waiting > for quorum" for Round N > - Node 1 times out after initLimit * tickTime with "Exception while trying > to follow leader" for Round N+1 at the same time. > - And the process continues where N is sequentially incrementing. > - This happens for a long time. > - In one instance, we used tickTime=5000 and initLimit=20 and it took > around 3.5 hours to converge. > - In a given round, Node 1 will try connecting to Node 2, gets connection > refused waits for notification timeout which increases by 2 every iteration > until it hits the initLimit. Connection Refused is because the node 2 comes > up after reboot, but zookeeper process is not started (due to a different > failure). > > It looks similar to ZOOKEEPER-2164 but there it is a connection timeout > where Node 2 is not reachable. > > Could you pls. share if you have seen this issue and if so, what is the > workaround that can be employed in 3.4.6. > > Thanks, > Anand.