hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Austin Shoemaker <aus...@cooliris.com>
Subject Re: Leader election stalled
Date Fri, 12 Sep 2008 23:51:33 GMT
Ben,

I am able to run algorithm 3 successfully sometimes, though frequently  
the servers deadlock in QuorumCnxManager:initiateConnection on  
s.read(msgBuffer) when reading the challenge from the peer.

Calls to initiateConnection and receiveConnection are synchronized, so  
only one or the other can be executing at a time. This prevents two  
connections from opening between the same pair of servers.

However, it seems that this leads to deadlock, as in this scenario:

A (initiate --> B)
B (initiate --> C)
C (initiate --> A)

initiateConnection can only complete when receiveConnection runs on  
the remote peer and answers the challenge. If all servers are blocked  
in initiateConnection, receiveConnection never runs and leader  
election halts.

Looking forward to your thoughts.

Thanks,

Austin

On Sep 2, 2008, at 10:14 AM, Benjamin Reed wrote:

> Austin,
>
> Could you try using the new leader election algorithm? You need to set
> the algorithm type to 3 and you also need to set the election port  
> (TCP)
> to be used.
>
> See http://zookeeper.wiki.sourceforge.net/ZooKeeperConfiguration for
> more details.
>
> ben
>
> -----Original Message-----
> From: Austin Shoemaker [mailto:austin@cooliris.com]
> Sent: Tuesday, September 02, 2008 9:57 AM
> To: zookeeper-user@hadoop.apache.org
> Subject: Leader election stalled
>
> Hi,
>
> We have run into a situation where killing the leader results in
> followers
> perpetually trying to reelect that leader.
>
> We have 11 zookeeper (2.2.1 from SF.net) servers and 256 clients
> connecting
> at random. We kill the leader and observe the impact, monitoring a
> script
> that repeatedly prints the responses to "ruok" and "stat". All servers
> except the killed leader respond with "imok" and "ZooKeeperServer not
> running", respectively.
>
> About half of the time, each remaining server gets into a loop of
> failing to
> connect to the killed leader and then reelecting the killed leader.
>
> Here is an example log, which is representative of similar logs on the
> other
> servers. We additionally logged connectivity during leader election.  
> If
> anyone would like complete logs, let me know.
>
> Thanks,
>
> Austin Shoemaker
>
> WARN  - [QuorumPeer:QuorumPeer@397] - FOLLOWING
> *WARN  - [QuorumPeer:Follower@124] - Following /10.50.65.22:2889*
> ERROR - [QuorumPeer:Follower@137] - FIXMSG
> java.net.ConnectException: Connection refused
> *
> .... cont'd ....*
>
> ERROR - [QuorumPeer:Follower@364] - FIXMSG
> java.lang.Exception: shutdown Follower
>        at
> com.yahoo.zookeeper.server.quorum.Follower.shutdown(Follower.java:364)
>        at
> com.yahoo.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:403)
> WARN  - [QuorumPeer:QuorumPeer@388] - LOOKING
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.22:2888
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.22:2888
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.21:2888
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.21:2888
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.12:2888
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.12:2888
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.11:2888
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.11:2888
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.12:2890
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.12:2890
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.11:2890
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.11:2890
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.22:2889
> *WARN  - [QuorumPeer:LeaderElection@166] - ----> Exception occurred  
> when
> sending / receiving packet to / from /10.50.65.22:2889
> java.net.SocketTimeoutException: Receive timed out
> *WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to
> /10.50.65.21:2890
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.21:2890
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.21:2889
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.21:2889
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.12:2889
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.12:2889
> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election  
> packet
> to /
> 10.50.65.11:2889
> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
> from /
> 10.50.65.11:2889
> WARN  - [QuorumPeer:LeaderElection@89] - Election tally:
> WARN  - [QuorumPeer:LeaderElection@95] - 8 -> 1
> WARN  - [QuorumPeer:LeaderElection@95] - 4 -> 1
> WARN  - [QuorumPeer:LeaderElection@95] - 7 -> 8
> WARN  - [QuorumPeer:LeaderElection@97] - ----> Election complete,
> result.winner = 7
> *WARN  - [QuorumPeer:LeaderElection@100] - ----> Election complete,
> address
> = /10.50.65.22:2889
> WARN  - [QuorumPeer:QuorumPeer@397] - FOLLOWING
> WARN  - [QuorumPeer:Follower@124] - Following /10.50.65.22:2889
> ERROR - [QuorumPeer:Follower@137] - FIXMSG
> java.net.ConnectException: Connection refused
> *        at java.net.PlainSocketImpl.socketConnect(Native Method)
>        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>        at
> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>        at java.net.Socket.connect(Socket.java:519)
>        at
> com 
> .yahoo.zookeeper.server.quorum.Follower.followLeader(Follower.java:13
> 3)
>        at
> com.yahoo.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:399)


Mime
View raw message