hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Austin Shoemaker" <aus...@cooliris.com>
Subject Leader election stalled
Date Tue, 02 Sep 2008 16:56:57 GMT
Hi,

We have run into a situation where killing the leader results in followers
perpetually trying to reelect that leader.

We have 11 zookeeper (2.2.1 from SF.net) servers and 256 clients connecting
at random. We kill the leader and observe the impact, monitoring a script
that repeatedly prints the responses to "ruok" and "stat". All servers
except the killed leader respond with "imok" and "ZooKeeperServer not
running", respectively.

About half of the time, each remaining server gets into a loop of failing to
connect to the killed leader and then reelecting the killed leader.

Here is an example log, which is representative of similar logs on the other
servers. We additionally logged connectivity during leader election. If
anyone would like complete logs, let me know.

Thanks,

Austin Shoemaker

WARN  - [QuorumPeer:QuorumPeer@397] - FOLLOWING
*WARN  - [QuorumPeer:Follower@124] - Following /10.50.65.22:2889*
ERROR - [QuorumPeer:Follower@137] - FIXMSG
java.net.ConnectException: Connection refused
*
.... cont'd ....*

ERROR - [QuorumPeer:Follower@364] - FIXMSG
java.lang.Exception: shutdown Follower
        at
com.yahoo.zookeeper.server.quorum.Follower.shutdown(Follower.java:364)
        at
com.yahoo.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:403)
WARN  - [QuorumPeer:QuorumPeer@388] - LOOKING
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.22:2888
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.22:2888
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.21:2888
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.21:2888
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.12:2888
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.12:2888
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.11:2888
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.11:2888
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.12:2890
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.12:2890
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.11:2890
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.11:2890
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.22:2889
*WARN  - [QuorumPeer:LeaderElection@166] - ----> Exception occurred when
sending / receiving packet to / from /10.50.65.22:2889
java.net.SocketTimeoutException: Receive timed out
*WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to
/10.50.65.21:2890
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.21:2890
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.21:2889
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.21:2889
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.12:2889
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.12:2889
WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election packet to /
10.50.65.11:2889
WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response from /
10.50.65.11:2889
WARN  - [QuorumPeer:LeaderElection@89] - Election tally:
WARN  - [QuorumPeer:LeaderElection@95] - 8 -> 1
WARN  - [QuorumPeer:LeaderElection@95] - 4 -> 1
WARN  - [QuorumPeer:LeaderElection@95] - 7 -> 8
WARN  - [QuorumPeer:LeaderElection@97] - ----> Election complete,
result.winner = 7
*WARN  - [QuorumPeer:LeaderElection@100] - ----> Election complete, address
= /10.50.65.22:2889
WARN  - [QuorumPeer:QuorumPeer@397] - FOLLOWING
WARN  - [QuorumPeer:Follower@124] - Following /10.50.65.22:2889
ERROR - [QuorumPeer:Follower@137] - FIXMSG
java.net.ConnectException: Connection refused
*        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at
java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:519)
        at
com.yahoo.zookeeper.server.quorum.Follower.followLeader(Follower.java:133)
        at
com.yahoo.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:399)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message