Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D2D8718328 for ; Mon, 25 Jan 2016 08:24:30 +0000 (UTC) Received: (qmail 57319 invoked by uid 500); 25 Jan 2016 08:24:30 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 57238 invoked by uid 500); 25 Jan 2016 08:24:29 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 57227 invoked by uid 99); 25 Jan 2016 08:24:29 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jan 2016 08:24:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 1904EC115A for ; Mon, 25 Jan 2016 08:24:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.978 X-Spam-Level: ** X-Spam-Status: No, score=2.978 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 68ta6yvcIkCv for ; Mon, 25 Jan 2016 08:24:27 +0000 (UTC) Received: from DUB004-OMC3S28.hotmail.com (dub004-omc3s28.hotmail.com [157.55.2.37]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 3A30320534 for ; Mon, 25 Jan 2016 08:24:26 +0000 (UTC) Received: from DUB116-W30 ([157.55.2.8]) by DUB004-OMC3S28.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Mon, 25 Jan 2016 00:24:19 -0800 X-TMN: [CUCvDtVkrTCyFNyRRwjujKuf+4uIZSU/] X-Originating-Email: [baruch_w@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_453e69c7-feeb-413d-a72d-781babd0ca68_" From: Bruno Wassermann To: "user@zookeeper.apache.org" Subject: RE: All nodes shutting down simultaneously Date: Mon, 25 Jan 2016 08:24:19 +0000 Importance: Normal In-Reply-To: <760C5ACB-F8DE-4388-BB40-D59F1059013C@apache.org> References: ,<760C5ACB-F8DE-4388-BB40-D59F1059013C@apache.org> MIME-Version: 1.0 X-OriginalArrivalTime: 25 Jan 2016 08:24:19.0121 (UTC) FILETIME=[C941B210:01D15749] --_453e69c7-feeb-413d-a72d-781babd0ca68_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi Flavio=2C The leader was indeed the server that got powered off (server 4). The configuration of the servers in this cluster is as follows.tickTime=3D2= 000dataDir=3D/var/lib/zookeeperclientPort=3D2181snapCount=3D1000000initLimi= t=3D10syncLimit=3D5autopurge.purgeInterval=3D2autopurge.snapRetainCount=3D4 The nodes were engaged in leader election for almost 90 seconds. I see mess= ages such as the following:2016-01-17 19:32:17=2C081 [myid:1] - INFO [Quor= umPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called....20= 16-01-17 19:32:35=2C920 [myid:1] - INFO [WorkerReceiver[myid=3D1]:FastLead= erElection@597] - Notification: 1 (message format version)=2C 3 (n.leader)= =2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=2C 3 (n.sid)= =2C 0x3 (n.peerEpoch) LOOKING (my state)....2016-01-17 19:33:41=2C709 [myid= :1] - INFO [WorkerReceiver[myid=3D1]:FastLeaderElection@597] - Notificatio= n: 1 (message format version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x= 4 (n.round)=2C LOOKING (n.state)=2C 5 (n.sid)=2C 0x3 (n.peerEpoch) LOOKING = (my state)2016-01-17 19:33:41=2C710 [myid:1] - INFO [WorkerReceiver[myid= =3D1]:FastLeaderElection@597] - Notification: 1 (message format version)=2C= 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)= =2C 3 (n.sid)=2C 0x3 (n.peerEpoch) LOOKING (my state)....2016-01-17 19:33:4= 3=2C602 [myid:1] - INFO [WorkerReceiver[myid=3D1]:FastLeaderElection@597] = - Notification: 1 (message format version)=2C 2 (n.leader)=2C 0x3000f3404 (= n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=2C 2 (n.sid)=2C 0x3 (n.peerEp= och) LOOKING (my state)2016-01-17 19:33:43=2C602 [myid:1] - INFO [WorkerRe= ceiver[myid=3D1]:FastLeaderElection@597] - Notification: 1 (message format = version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKIN= G (n.state)=2C 2 (n.sid)=2C 0x3 (n.peerEpoch) LOOKING (my state)....2016-01= -17 19:33:46=2C575 [myid:1] - INFO [WorkerReceiver[myid=3D1]:FastLeaderEle= ction@597] - Notification: 1 (message format version)=2C 1 (n.leader)=2C 0x= 3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=2C 1 (n.sid)=2C 0x= 3 (n.peerEpoch) LOOKING (my state)2016-01-17 19:33:46=2C575 [myid:1] - INFO= [WorkerReceiver[myid=3D1]:FastLeaderElection@597] - Notification: 1 (mess= age format version)=2C 3 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round= )=2C LOOKING (n.state)=2C 1 (n.sid)=2C 0x3 (n.peerEpoch) LOOKING (my state)= 2016-01-17 19:33:46=2C575 [myid:1] - INFO [WorkerReceiver[myid=3D1]:FastLe= aderElection@597] - Notification: 1 (message format version)=2C 5 (n.leader= )=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=2C 1 (n.sid= )=2C 0x3 (n.peerEpoch) LOOKING (my state)2016-01-17 19:33:46=2C576 [myid:1]= - INFO [WorkerReceiver[myid=3D1]:FastLeaderElection@597] - Notification: = 1 (message format version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (= n.round)=2C FOLLOWING (n.state)=2C 2 (n.sid)=2C 0x4 (n.peerEpoch) LOOKING (= my state)2016-01-17 19:33:46=2C576 [myid:1] - INFO [WorkerReceiver[myid=3D= 1]:FastLeaderElection@597] - Notification: 1 (message format version)=2C 5 = (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C FOLLOWING (n.state)= =2C 2 (n.sid)=2C 0x4 (n.peerEpoch) LOOKING (my state)2016-01-17 19:33:46=2C= 576 [myid:1] - INFO [WorkerReceiver[myid=3D1]:FastLeaderElection@597] - No= tification: 1 (message format version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zx= id)=2C 0x4 (n.round)=2C LEADING (n.state)=2C 5 (n.sid)=2C 0x4 (n.peerEpoch)= LOOKING (my state)2016-01-17 19:33:46=2C576 [myid:1] - INFO [WorkerReceiv= er[myid=3D1]:FastLeaderElection@597] - Notification: 1 (message format vers= ion)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LEADING (n= .state)=2C 5 (n.sid)=2C 0x4 (n.peerEpoch) LOOKING (my state)2016-01-17 19:3= 3:46=2C576 [myid:1] - INFO [WorkerReceiver[myid=3D1]:FastLeaderElection@59= 7] - Notification: 1 (message format version)=2C 5 (n.leader)=2C 0x3000f340= 4 (n.zxid)=2C 0x4 (n.round)=2C FOLLOWING (n.state)=2C 3 (n.sid)=2C 0x4 (n.p= eerEpoch) LOOKING (my state)2016-01-17 19:33:46=2C576 [myid:1] - INFO [Wor= kerReceiver[myid=3D1]:FastLeaderElection@597] - Notification: 1 (message fo= rmat version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C F= OLLOWING (n.state)=2C 3 (n.sid)=2C 0x4 (n.peerEpoch) LOOKING (my state)2016= -01-17 19:33:46=2C576 [myid:1] - INFO [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:= 0:2181:QuorumPeer@784] - FOLLOWING2016-01-17 19:33:46=2C576 [myid:1] - INFO= [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEA= DER ELECTION TOOK - 894892016-01-17 19:33:46=2C580 [myid:1] - INFO [Quorum= Peer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Learner@323] - Getting a diff from the = leader 0x4000000192016-01-17 19:33:46=2C580 [myid:1] - WARN [QuorumPeer[my= id=3D1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid 0x400000001 expected 0= x12016-01-17 19:33:46=2C581 [myid:1] - INFO [QuorumPeer[myid=3D1]/0:0:0:0:= 0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: 0x400000019 to /var/lib/zo= okeeper/version-2/snapshot.400000019 At the time of the incident=2C the amount of data held was about 10.2 MB in= about 1=2C800 znodes. The size of the snapshot closest to the time of the = incident was 18 MB and the size of the transaction log on that day was 12 G= B. Should there be something in the log files about the initLimit being rea= ched? If the long election time was really due to a large snapshot=2C we could de= crease the snapCount and increase both the initLimit and syncLimit. Is this= correct?=20 Thanks=2C=20 -- Bruno > Subject: Re: All nodes shutting down simultaneously > From: fpj@apache.org > Date: Sun=2C 24 Jan 2016 14:19:43 +0000 > CC: brunow@il.ibm.com > To: user@zookeeper.apache.org >=20 > Hi Bruno=2C >=20 > If the leader crashes=2C then the followers are supposed to start complai= ning around the same time. Was the server that powered down the leader of t= he ensemble? >=20 > Another leader election shouldn't take two minutes=2C so you could check = the logs for what happened around the time the ensemble wasn't able to get = a leader established. Were election notifications being exchanged regularly= ? Was the size of the snapshot too large and the init limit was kicking in? >=20 > When you say that the servers shut themselves down=2C I suppose you're re= ferring to the fact that you've seen a message in the logs saying it. The s= erver process remained up and they haven't exited=2C yes? >=20 > -Flavio >=20 > > On 24 Jan 2016=2C at 12:05=2C Bruno Wassermann w= rote: > >=20 > > Hi I have recently come across something I have not seen before: all 5 = nodes of a ZK cluster shutting themselves down at around the same time. Al= l servers worked fine for a long time. Ranging from 2016-01-17 19:31 to 19:= 33:xx=2C the servers complained about not being able to follow the leader a= nd the zk processes shut themselves down. One of the servers (actual host a= s in bare-metal) was somehow powered off at 19:33. The ZK cluster was down= for approximately two minutes. Leader election took 90 seconds. The leader= then expired more than 100 client connections. A snippet from the log file= of one of the nodes:2016-01-17 19:32:17=2C081 [myid:1] - WARN [QuorumPeer= [myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the= leaderjava.net.SocketTimeoutException: Read timed out at java.net.Socke= tInputStream.socketRead0(Native Method) at java.net.SocketInputStream.so= cketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(= SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInp= utStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStre= am.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.jav= a:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) a= t org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) = at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacke= t.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputA= rchive.java:103) at org.apache.zookeeper.server.quorum.Learner.readPacke= t(Learner.java:153) at org.apache.zookeeper.server.quorum.Follower.follo= wLeader(Follower.java:85) at org.apache.zookeeper.server.quorum.QuorumPe= er.run(QuorumPeer.java:786)2016-01-17 19:32:17=2C081 [myid:1] - INFO [Quor= umPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown calledjava.l= ang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.F= ollower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quoru= m.QuorumPeer.run(QuorumPeer.java:790)Followed by closing client connections= . I have analysed garbage collection behaviour on these five nodes and foun= d that on that day two of the nodes underwent one GC collection of 80 ms ea= ch. What does the read time out indicate? Does this mean that there was a = network issue or was there maybe too much data to sync on in a timely manne= r? Why would all nodes shut down at pretty much the same time? All the be= st=2C -- Bruno =20 >=20 = --_453e69c7-feeb-413d-a72d-781babd0ca68_--