Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Message-ID: <DUB116-W30F3233E7681C53FF12038E8C70@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_453e69c7-feeb-413d-a72d-781babd0ca68_"
From: Bruno Wassermann <baruch_w@hotmail.com>
To: "user@zookeeper.apache.org" <user@zookeeper.apache.org>
Subject: RE: All nodes shutting down simultaneously
Date: Mon, 25 Jan 2016 08:24:19 +0000
Importance: Normal
In-Reply-To: <760C5ACB-F8DE-4388-BB40-D59F1059013C@apache.org>
References: 
 <DUB116-W31B34B8498D30944FAE238E8C60@phx.gbl>,<760C5ACB-F8DE-4388-BB40-D59F1059013C@apache.org>
MIME-Version: 1.0

--_453e69c7-feeb-413d-a72d-781babd0ca68_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi Flavio=2C
The leader was indeed the server that got powered off (server 4).
The configuration of the servers in this cluster is as follows.tickTime=3D2=
000dataDir=3D/var/lib/zookeeperclientPort=3D2181snapCount=3D1000000initLimi=
t=3D10syncLimit=3D5autopurge.purgeInterval=3D2autopurge.snapRetainCount=3D4
The nodes were engaged in leader election for almost 90 seconds. I see mess=
ages such as the following:2016-01-17 19:32:17=2C081 [myid:1] - INFO  [Quor=
umPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called....20=
16-01-17 19:32:35=2C920 [myid:1] - INFO  [WorkerReceiver[myid=3D1]:FastLead=
erElection@597] - Notification: 1 (message format version)=2C 3 (n.leader)=
=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=2C 3 (n.sid)=
=2C 0x3 (n.peerEpoch) LOOKING (my state)....2016-01-17 19:33:41=2C709 [myid=
:1] - INFO  [WorkerReceiver[myid=3D1]:FastLeaderElection@597] - Notificatio=
n: 1 (message format version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x=
4 (n.round)=2C LOOKING (n.state)=2C 5 (n.sid)=2C 0x3 (n.peerEpoch) LOOKING =
(my state)2016-01-17 19:33:41=2C710 [myid:1] - INFO  [WorkerReceiver[myid=
=3D1]:FastLeaderElection@597] - Notification: 1 (message format version)=2C=
 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=
=2C 3 (n.sid)=2C 0x3 (n.peerEpoch) LOOKING (my state)....2016-01-17 19:33:4=
3=2C602 [myid:1] - INFO  [WorkerReceiver[myid=3D1]:FastLeaderElection@597] =
- Notification: 1 (message format version)=2C 2 (n.leader)=2C 0x3000f3404 (=
n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=2C 2 (n.sid)=2C 0x3 (n.peerEp=
och) LOOKING (my state)2016-01-17 19:33:43=2C602 [myid:1] - INFO  [WorkerRe=
ceiver[myid=3D1]:FastLeaderElection@597] - Notification: 1 (message format =
version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKIN=
G (n.state)=2C 2 (n.sid)=2C 0x3 (n.peerEpoch) LOOKING (my state)....2016-01=
-17 19:33:46=2C575 [myid:1] - INFO  [WorkerReceiver[myid=3D1]:FastLeaderEle=
ction@597] - Notification: 1 (message format version)=2C 1 (n.leader)=2C 0x=
3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=2C 1 (n.sid)=2C 0x=
3 (n.peerEpoch) LOOKING (my state)2016-01-17 19:33:46=2C575 [myid:1] - INFO=
  [WorkerReceiver[myid=3D1]:FastLeaderElection@597] - Notification: 1 (mess=
age format version)=2C 3 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round=
)=2C LOOKING (n.state)=2C 1 (n.sid)=2C 0x3 (n.peerEpoch) LOOKING (my state)=
2016-01-17 19:33:46=2C575 [myid:1] - INFO  [WorkerReceiver[myid=3D1]:FastLe=
aderElection@597] - Notification: 1 (message format version)=2C 5 (n.leader=
)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LOOKING (n.state)=2C 1 (n.sid=
)=2C 0x3 (n.peerEpoch) LOOKING (my state)2016-01-17 19:33:46=2C576 [myid:1]=
 - INFO  [WorkerReceiver[myid=3D1]:FastLeaderElection@597] - Notification: =
1 (message format version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (=
n.round)=2C FOLLOWING (n.state)=2C 2 (n.sid)=2C 0x4 (n.peerEpoch) LOOKING (=
my state)2016-01-17 19:33:46=2C576 [myid:1] - INFO  [WorkerReceiver[myid=3D=
1]:FastLeaderElection@597] - Notification: 1 (message format version)=2C 5 =
(n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C FOLLOWING (n.state)=
=2C 2 (n.sid)=2C 0x4 (n.peerEpoch) LOOKING (my state)2016-01-17 19:33:46=2C=
576 [myid:1] - INFO  [WorkerReceiver[myid=3D1]:FastLeaderElection@597] - No=
tification: 1 (message format version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zx=
id)=2C 0x4 (n.round)=2C LEADING (n.state)=2C 5 (n.sid)=2C 0x4 (n.peerEpoch)=
 LOOKING (my state)2016-01-17 19:33:46=2C576 [myid:1] - INFO  [WorkerReceiv=
er[myid=3D1]:FastLeaderElection@597] - Notification: 1 (message format vers=
ion)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C LEADING (n=
.state)=2C 5 (n.sid)=2C 0x4 (n.peerEpoch) LOOKING (my state)2016-01-17 19:3=
3:46=2C576 [myid:1] - INFO  [WorkerReceiver[myid=3D1]:FastLeaderElection@59=
7] - Notification: 1 (message format version)=2C 5 (n.leader)=2C 0x3000f340=
4 (n.zxid)=2C 0x4 (n.round)=2C FOLLOWING (n.state)=2C 3 (n.sid)=2C 0x4 (n.p=
eerEpoch) LOOKING (my state)2016-01-17 19:33:46=2C576 [myid:1] - INFO  [Wor=
kerReceiver[myid=3D1]:FastLeaderElection@597] - Notification: 1 (message fo=
rmat version)=2C 5 (n.leader)=2C 0x3000f3404 (n.zxid)=2C 0x4 (n.round)=2C F=
OLLOWING (n.state)=2C 3 (n.sid)=2C 0x4 (n.peerEpoch) LOOKING (my state)2016=
-01-17 19:33:46=2C576 [myid:1] - INFO  [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:=
0:2181:QuorumPeer@784] - FOLLOWING2016-01-17 19:33:46=2C576 [myid:1] - INFO=
  [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEA=
DER ELECTION TOOK - 894892016-01-17 19:33:46=2C580 [myid:1] - INFO  [Quorum=
Peer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Learner@323] - Getting a diff from the =
leader 0x4000000192016-01-17 19:33:46=2C580 [myid:1] - WARN  [QuorumPeer[my=
id=3D1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid 0x400000001 expected 0=
x12016-01-17 19:33:46=2C581 [myid:1] - INFO  [QuorumPeer[myid=3D1]/0:0:0:0:=
0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: 0x400000019 to /var/lib/zo=
okeeper/version-2/snapshot.400000019

At the time of the incident=2C the amount of data held was about 10.2 MB in=
 about 1=2C800 znodes. The size of the snapshot closest to the time of the =
incident was 18 MB and the size of the transaction log on that day was 12 G=
B. Should there be something in the log files about the initLimit being rea=
ched?
If the long election time was really due to a large snapshot=2C we could de=
crease the snapCount and increase both the initLimit and syncLimit. Is this=
 correct?=20
Thanks=2C=20
-- Bruno
> Subject: Re: All nodes shutting down simultaneously
> From: fpj@apache.org
> Date: Sun=2C 24 Jan 2016 14:19:43 +0000
> CC: brunow@il.ibm.com
> To: user@zookeeper.apache.org
>=20
> Hi Bruno=2C
>=20
> If the leader crashes=2C then the followers are supposed to start complai=
ning around the same time. Was the server that powered down the leader of t=
he ensemble?
>=20
> Another leader election shouldn't take two minutes=2C so you could check =
the logs for what happened around the time the ensemble wasn't able to get =
a leader established. Were election notifications being exchanged regularly=
? Was the size of the snapshot too large and the init limit was kicking in?
>=20
> When you say that the servers shut themselves down=2C I suppose you're re=
ferring to the fact that you've seen a message in the logs saying it. The s=
erver process remained up and they haven't exited=2C yes?
>=20
> -Flavio
>=20
> > On 24 Jan 2016=2C at 12:05=2C Bruno Wassermann <baruch_w@hotmail.com> w=
rote:
> >=20
> > Hi I have recently come across something I have not seen before: all 5 =
nodes of a ZK cluster shutting themselves down at around the same time.  Al=
l servers worked fine for a long time. Ranging from 2016-01-17 19:31 to 19:=
33:xx=2C the servers complained about not being able to follow the leader a=
nd the zk processes shut themselves down. One of the servers (actual host a=
s in bare-metal) was somehow powered off at 19:33.  The ZK cluster was down=
 for approximately two minutes. Leader election took 90 seconds. The leader=
 then expired more than 100 client connections. A snippet from the log file=
 of one of the nodes:2016-01-17 19:32:17=2C081 [myid:1] - WARN  [QuorumPeer=
[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the=
 leaderjava.net.SocketTimeoutException: Read timed out    at java.net.Socke=
tInputStream.socketRead0(Native Method)    at java.net.SocketInputStream.so=
cketRead(SocketInputStream.java:116)    at java.net.SocketInputStream.read(=
SocketInputStream.java:170)    at java.net.SocketInputStream.read(SocketInp=
utStream.java:141)    at java.io.BufferedInputStream.fill(BufferedInputStre=
am.java:246)    at java.io.BufferedInputStream.read(BufferedInputStream.jav=
a:265)    at java.io.DataInputStream.readInt(DataInputStream.java:387)    a=
t org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)   =
 at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacke=
t.java:83)    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputA=
rchive.java:103)    at org.apache.zookeeper.server.quorum.Learner.readPacke=
t(Learner.java:153)    at org.apache.zookeeper.server.quorum.Follower.follo=
wLeader(Follower.java:85)    at org.apache.zookeeper.server.quorum.QuorumPe=
er.run(QuorumPeer.java:786)2016-01-17 19:32:17=2C081 [myid:1] - INFO  [Quor=
umPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown calledjava.l=
ang.Exception: shutdown Follower    at org.apache.zookeeper.server.quorum.F=
ollower.shutdown(Follower.java:166)    at org.apache.zookeeper.server.quoru=
m.QuorumPeer.run(QuorumPeer.java:790)Followed by closing client connections=
. I have analysed garbage collection behaviour on these five nodes and foun=
d that on that day two of the nodes underwent one GC collection of 80 ms ea=
ch.  What does the read time out indicate? Does this mean that there was a =
network issue or was there maybe too much data to sync on in a timely manne=
r? Why would all nodes shut down at pretty much the same time?   All the be=
st=2C  -- Bruno 		 	   		 =20
>=20
 		 	   		  =

--_453e69c7-feeb-413d-a72d-781babd0ca68_--