Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D0A61184D0 for ; Mon, 25 Jan 2016 19:30:19 +0000 (UTC) Received: (qmail 61890 invoked by uid 500); 25 Jan 2016 19:30:19 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 61842 invoked by uid 500); 25 Jan 2016 19:30:19 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 61830 invoked by uid 99); 25 Jan 2016 19:30:18 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jan 2016 19:30:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5A27A180530 for ; Mon, 25 Jan 2016 19:30:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.111 X-Spam-Level: *** X-Spam-Status: No, score=3.111 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, HTML_MESSAGE=3, T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=fail (1024-bit key) reason="fail (bad RSA signature)" header.d=airtime.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id pIhg7hb1N047 for ; Mon, 25 Jan 2016 19:30:04 +0000 (UTC) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 7370543A1A for ; Mon, 25 Jan 2016 19:30:04 +0000 (UTC) Received: by mail-ob0-f182.google.com with SMTP id ba1so126159195obb.3 for ; Mon, 25 Jan 2016 11:30:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=airtime.com; s=airtime; h=mime-version:date:message-id:subject:from:to:content-type; bh=RrVwOqY8ING9ydCJuWf7S0IEFyybzjDcu2IfqClLpao=; b=W0wXbqdzGjrP8xptJrWA+9ox4H4op4hsKyswILqhASfBZJAWUFLIOQ1soiugBjpith wgU37rvrdcjWCHgfBN3o3wzsh9tMlu41XV5yLU8/iOP0L1BiV7kxW/iHLtQUJ9fz8f6p h1t+bxK4rJOIBxgdSvypeThZmPVSdqg2iUPQE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=RrVwOqY8ING9ydCJuWf7S0IEFyybzjDcu2IfqClLpao=; b=EkZgV2HryPc62A/SkwoBGomJTxIdDc+LQjOQdcb5EpDuDvBBZJQeZHJa1/+yrbk2tT JU3tQTIO8PRlXud11MLNRb+istqRsKMn9mf8qJPEVylL4K8JLjCuwjiBmEgDyqHLJmTv FJGoujjJOEt9wkObFHtP1snwz8Pi50KOGHzIH6v7fLuM0sLKLfxZr/sSFa84iwKolhHo jvbHv90wrTdtQnY1+U0wJZ5v2XSzmTR645rQkdMy+qiSpob8330RM5PeE+0972rEnECf A+KarSwpDF4CcWirmd656Z+ubrAfocAxVfq45+Pa5LsxCCACw7UDmqGQGpCQqWxuzT7v foHA== X-Gm-Message-State: AG10YOTI7TsN806Tri3EyBqRYFuCujzhUWmZbk8Wie9DCo/hQ31GvVivWAZnO6BtRCogp0Dp0D7pHF38jhaagA== MIME-Version: 1.0 X-Received: by 10.60.116.169 with SMTP id jx9mr14610493oeb.30.1453750203840; Mon, 25 Jan 2016 11:30:03 -0800 (PST) Received: by 10.202.87.2 with HTTP; Mon, 25 Jan 2016 11:30:03 -0800 (PST) Date: Mon, 25 Jan 2016 11:30:03 -0800 Message-ID: Subject: Zookeeper leader is terminated in AWS and cluster never recovers From: Ken Mamitsuka To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=089e0115f28e7e191d052a2d9887 --089e0115f28e7e191d052a2d9887 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable This may be related to the post "All nodes shutting down simultaneously", but I'm not sure. I was able to reproduce this quite a few times. I terminated the leader of a 5 node zookeeper cluster (terminated from the AWS console). All the remaining hosts saw the leader disappear: Jan 20 18:40:05 localhost zookeeper: WARN [QuorumPeer[myid=3D5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out Then it looks like a shutdown of the ZooKeeperServer and its processors: Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting down I=E2=80=99m not sure if we should see startup logs or not. Then we see the = leader election appear to be successful (I=E2=80=99m not sure if the =E2=80=9CZooK= eeperServer not running=E2=80=9D log is ominous or not): Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot /zookeeper/data/version-2/snapshot.8005dc366 Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id =3D 2, proposed zxid=3D0x8005e12e8 Jan 20 18:40:05 localhost zookeeper: INFO [WorkerReceiver[myid=3D2]:FastLeaderElection@600] - Notification: 1 (messa= ge format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) ... Jan 20 18:40:06 localhost zookeeper: INFO [WorkerReceiver[myid=3D2]:FastLeaderElection@600] - Notification: 1 (messa= ge format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2 Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEAD= ER ELECTION TOOK - 347 Jan 20 18:40:06 localhost zookeeper: INFO [WorkerReceiver[myid=3D2]:FastLeaderElection@600] - Notification: 1 (messa= ge format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state) Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - Resolved hostname: zookeeper-5-internal.prod.airtime.com to address: zookeeper-5-internal.prod.airtime.com/10.1.24.208 However, attempts to connect to the new leader just fail forever: Jan 20 18:40:06 localhost zookeeper: WARN [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=3D0, connecting to zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 Jan 20 18:40:07 localhost zookeeper: WARN [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=3D1, connecting to zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 Jan 20 18:40:08 localhost zookeeper: WARN [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=3D2, connecting to zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 Jan 20 18:40:09 localhost zookeeper: WARN [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=3D3, connecting to zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 Then it throws an exception and shuts down: Jan 20 18:40:10 localhost zookeeper: WARN [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower Jan 20 18:40:10 localhost at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) Jan 20 18:40:10 localhost at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850) Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id =3D 2, proposed zxid=3D0x8005e12e8 Jan 20 18:40:10 localhost zookeeper: INFO [WorkerReceiver[myid=3D2]:FastLeaderElection@600] - Notification: 1 (messa= ge format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round), LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) And then the fast leader election fails essentially forever: Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400 Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800 Jan 20 18:40:11 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600 Jan 20 18:40:13 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 3200 Jan 20 18:40:16 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 6400 Jan 20 18:40:22 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 12800 Jan 20 18:40:35 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 25600 Jan 20 18:41:01 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 51200 Jan 20 18:41:52 localhost zookeeper: INFO [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com= ), it doesn=E2=80=99t seem like it ever recognizes that it was the leader of t= he original election. I=E2=80=99m not sure if it=E2=80=99s because it can=E2= =80=99t start or something else. Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=3D5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id =3D 5, proposed zxid=3D0x8005e12e8 Jan 20 18:40:05 localhost zookeeper: INFO [WorkerReceiver[myid=3D5]:FastLeaderElection@600] - Notification: 1 (messa= ge format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=3D5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400 Jan 20 18:40:06 localhost zookeeper: INFO [WorkerReceiver[myid=3D5]:FastLeaderElection@600] - Notification: 1 (messa= ge format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=3D5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800 Jan 20 18:40:06 localhost zookeeper: INFO [WorkerReceiver[myid=3D5]:FastLeaderElection@600] - Notification: 1 (messa= ge format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) Jan 20 18:40:07 localhost zookeeper: INFO [QuorumPeer[myid=3D5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600 A couple of additional notes: 1. Restarting zookeeper on all hosts seemed to resolve the issue. 2. This was easily reproducible for me in 3 different zookeeper clusters in AWS on Jan 20/21. Other things like killing follower hosts worked fin= e. 3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if anything network related could make this situation fail, while everythin= g else worked fine. --089e0115f28e7e191d052a2d9887--