zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Mamitsuka <...@airtime.com>
Subject Re: Zookeeper leader is terminated in AWS and cluster never recovers
Date Mon, 01 Feb 2016 17:57:49 GMT
Thanks Raul. I haven't been able to repro the issue recently, but I'll get
that info when/if I am able to do so.

On Wed, Jan 27, 2016 at 5:39 PM, Raúl Gutiérrez Segalés <rgs@itevenworks.net
> wrote:

> Can you see traffic on the election and ZAB ports?
>
> You can sniff these messages via https://github.com/twitter/zktraffic.
>
> Maybe jstack the processes and post that as well?
>
> -rgs
> On Jan 27, 2016 10:45 AM, "Ken Mamitsuka" <ken@airtime.com> wrote:
>
> > Thanks Flavio, I uploaded the log files to:
> >
> >
> >
> https://drive.google.com/folderview?id=0B_aqYHmVv9E-LUZJYnlJRDYyNjg&usp=sharing
> >
> > log_follower - log from one of the followers
> > log_new_leader - log from the elected leader (that everyone fails to
> > connect to)
> >
> > The issue is triggered when we killed the original leader at:
> >
> > Jan 20 18:40:05 localhost zookeeper: WARN
> >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> > following the leader
> >
> > and recovered after i restarted all the zookeepers (monit restart
> > zookeeper-server which just wraps zkServer.sh):
> >
> > Jan 20 18:52:50 zookeeper-2 ansible-service: Invoked with
> > name=zookeeper-server pattern=None enabled=None state=restarted
> sleep=None
> > arguments= runlevel=default
> >
> > I don't believe this is related to ZOOKEEPER-1506 because nothing in DNS
> > changed, we just killed the leader (and the resolved ip address for
> > zookeeper-5.eng.airtime.com is valid).
> >
> > Thanks again for looking at this.
> >
> > On Wed, Jan 27, 2016 at 7:52 AM, Flavio Junqueira <fpj@apache.org>
> wrote:
> >
> > > Perhaps you want to upload your logs somewhere for us to take a look.
> > > There has been issues around name resolution, like this one:
> > >
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-1506 <
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-1506>
> > >
> > > Is there any chance it could be related to it?
> > >
> > > -Flavio
> > >
> > >
> > > > On 26 Jan 2016, at 22:19, Jim Wong <jim@airtime.com> wrote:
> > > >
> > > > Folks,
> > > >
> > > > Does anyone have any thoughts on this one? We can semi-reliably cause
> > > our cluster to get into a seemingly permanent bad stat--at least
> several
> > > minutes, which is longer than we can tolerate in production--just by
> > > terminating the leader. We’ve seen recommendations to change things
> like
> > > the amount of time allotted for the database to be reconstructed from
> the
> > > last snapshot, but none of them have helped.
> > > >
> > > > If nothing else, are there any particular log messages we should be
> > > looking for or troubleshooting strategies we can pursue?
> > > >
> > > >
> > > >
> > > >> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <ken@airtime.com>
> wrote:
> > > >>
> > > >> This may be related to the post "All nodes shutting down
> > > simultaneously",
> > > >> but I'm not sure. I was able to reproduce this quite a few times.
I
> > > >> terminated the leader of a 5 node zookeeper cluster (terminated from
> > the
> > > >> AWS console). All the remaining hosts saw the leader disappear:
> > > >>
> > > >> Jan 20 18:40:05 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception
> > when
> > > >> following the leader
> > > >> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read
> timed
> > > out
> > > >>
> > > >> Then it looks like a shutdown of the ZooKeeperServer and  its
> > > processors:
> > > >>
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] -
> > > shutting
> > > >> down
> > > >>
> > > >> I’m not sure if we should see startup logs or not. Then we see the
> > > leader
> > > >> election appear to be successful (I’m not sure if the
> “ZooKeeperServer
> > > not
> > > >> running” log is ominous or not):
> > > >>
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading
> > > snapshot
> > > >> /zookeeper/data/version-2/snapshot.8005dc366
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> > New
> > > >> election. My id =  2, proposed zxid=0x8005e12e8
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> ...
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] -
> FOLLOWING
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] -
> > Created
> > > >> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout
> > 40000
> > > >> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING
-
> > > LEADER
> > > >> ELECTION TOOK - 347
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149
> ]
> > -
> > > >> Resolved hostname: zookeeper-5-internal.prod.airtime.com to
> address:
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208
> > > >>
> > > >> However, attempts to connect to the new leader just fail forever:
> > > >>
> > > >> Jan 20 18:40:06 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > > >> exception, tries=0, connecting to
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > > >> Jan 20 18:40:07 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > > >> exception, tries=1, connecting to
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > > >> Jan 20 18:40:08 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > > >> exception, tries=2, connecting to
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > > >> Jan 20 18:40:09 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > > >> exception, tries=3, connecting to
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > > >>
> > > >> Then it throws an exception and shuts down:
> > > >>
> > > >> Jan 20 18:40:10 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception
> > when
> > > >> following the leader
> > > >> Jan 20 18:40:10 localhost java.net.ConnectException: Connection
> > refused
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> > > called
> > > >> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
> > > >> Jan 20 18:40:10 localhost     at
> > > >>
> > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> > > >> Jan 20 18:40:10 localhost     at
> > > >>
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140
> ]
> > -
> > > >> Shutting down
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> > New
> > > >> election. My id =  2, proposed zxid=0x8005e12e8
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
> > > >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >>
> > > >> And then the fast leader election fails essentially forever:
> > > >>
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 400
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 800
> > > >> Jan 20 18:40:11 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 1600
> > > >> Jan 20 18:40:13 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 3200
> > > >> Jan 20 18:40:16 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 6400
> > > >> Jan 20 18:40:22 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 12800
> > > >> Jan 20 18:40:35 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 25600
> > > >> Jan 20 18:41:01 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 51200
> > > >> Jan 20 18:41:52 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 60000
> > > >>
> > > >> Looking at the node elected as the new leader (
> > > zookeeper-5.prod.airtime.com),
> > > >> it doesn’t seem like it ever recognizes that it was the leader of
> the
> > > >> original election. I’m not sure if it’s because it can’t start
or
> > > something
> > > >> else.
> > > >>
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> > New
> > > >> election. My id =  5, proposed zxid=0x8005e12e8
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 400
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 800
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> Jan 20 18:40:07 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 1600
> > > >>
> > > >> A couple of additional notes:
> > > >>
> > > >>  1. Restarting zookeeper on all hosts seemed to resolve the issue.
> > > >>  2. This was easily reproducible for me in 3 different zookeeper
> > > clusters
> > > >>  in AWS on Jan 20/21. Other things like killing follower hosts
> worked
> > > fine.
> > > >>  3. I wasn't able to reproduce this yesterday (Jan 24th), so I
> wonder
> > if
> > > >>  anything network related could make this situation fail, while
> > > everything
> > > >>  else worked fine.
> > > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message