hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Austin Shoemaker <aus...@cooliris.com>
Subject Re: Leader election stalled
Date Tue, 02 Sep 2008 22:07:47 GMT
We will retry with the new election algorithm and let you know the  
results.

Thanks for getting back so quickly.

Austin

On Sep 2, 2008, at 10:22 AM, Benjamin Reed wrote:

> I think there is a race condition that is probably easy to get into  
> with
> the old leader election and a large number of servers:
>
> 1) Leader dies
> 2) Followers start looking for a new leader before all Followers have
> abandoned the Leader
> 3) The Followers looking for a new leader see votes of Followers still
> following the (now dead) Leader and start voting for the dead Leader
> 4) The dead Leader gets reelected.
>
> For the old leader election a server should not vote for another  
> server
> that is not nominating himself.
>
> I'll open a Jira.
>
> ben
>
> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Tuesday, September 02, 2008 10:06 AM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Leader election stalled
>
> Hi Austin,
> Did you kill the leader process? It looks like that you didn't kill  
> the
> server since its responding to ruok. Is that true?
>
> mahadev
>
>
> On 9/2/08 9:56 AM, "Austin Shoemaker" <austin@cooliris.com> wrote:
>
>> Hi,
>>
>> We have run into a situation where killing the leader results in
> followers
>> perpetually trying to reelect that leader.
>>
>> We have 11 zookeeper (2.2.1 from SF.net) servers and 256 clients
> connecting
>> at random. We kill the leader and observe the impact, monitoring a
> script
>> that repeatedly prints the responses to "ruok" and "stat". All  
>> servers
>> except the killed leader respond with "imok" and "ZooKeeperServer not
>> running", respectively.
>>
>> About half of the time, each remaining server gets into a loop of
> failing to
>> connect to the killed leader and then reelecting the killed leader.
>>
>> Here is an example log, which is representative of similar logs on  
>> the
> other
>> servers. We additionally logged connectivity during leader election.
> If
>> anyone would like complete logs, let me know.
>>
>> Thanks,
>>
>> Austin Shoemaker
>>
>> WARN  - [QuorumPeer:QuorumPeer@397] - FOLLOWING
>> *WARN  - [QuorumPeer:Follower@124] - Following /10.50.65.22:2889*
>> ERROR - [QuorumPeer:Follower@137] - FIXMSG
>> java.net.ConnectException: Connection refused
>> *
>> .... cont'd ....*
>>
>> ERROR - [QuorumPeer:Follower@364] - FIXMSG
>> java.lang.Exception: shutdown Follower
>>        at
>> com.yahoo.zookeeper.server.quorum.Follower.shutdown(Follower.java: 
>> 364)
>>        at
>> com.yahoo.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:403)
>> WARN  - [QuorumPeer:QuorumPeer@388] - LOOKING
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.22:2888
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.22:2888
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.21:2888
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.21:2888
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.12:2888
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.12:2888
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.11:2888
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.11:2888
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.12:2890
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.12:2890
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.11:2890
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.11:2890
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.22:2889
>> *WARN  - [QuorumPeer:LeaderElection@166] - ----> Exception occurred
> when
>> sending / receiving packet to / from /10.50.65.22:2889
>> java.net.SocketTimeoutException: Receive timed out
>> *WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to
>> /10.50.65.21:2890
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.21:2890
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.21:2889
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.21:2889
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.12:2889
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.12:2889
>> WARN  - [QuorumPeer:LeaderElection@136] - ----> Sending election
> packet to /
>> 10.50.65.11:2889
>> WARN  - [QuorumPeer:LeaderElection@153] - ----> Received response  
>> from
> /
>> 10.50.65.11:2889
>> WARN  - [QuorumPeer:LeaderElection@89] - Election tally:
>> WARN  - [QuorumPeer:LeaderElection@95] - 8 -> 1
>> WARN  - [QuorumPeer:LeaderElection@95] - 4 -> 1
>> WARN  - [QuorumPeer:LeaderElection@95] - 7 -> 8
>> WARN  - [QuorumPeer:LeaderElection@97] - ----> Election complete,
>> result.winner = 7
>> *WARN  - [QuorumPeer:LeaderElection@100] - ----> Election complete,
> address
>> = /10.50.65.22:2889
>> WARN  - [QuorumPeer:QuorumPeer@397] - FOLLOWING
>> WARN  - [QuorumPeer:Follower@124] - Following /10.50.65.22:2889
>> ERROR - [QuorumPeer:Follower@137] - FIXMSG
>> java.net.ConnectException: Connection refused
>> *        at java.net.PlainSocketImpl.socketConnect(Native Method)
>>        at
> java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>>        at
>> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>>        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>>        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>>        at java.net.Socket.connect(Socket.java:519)
>>        at
>>
> com 
> .yahoo.zookeeper.server.quorum.Follower.followLeader(Follower.java:13
> 3)
>>        at
>> com.yahoo.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:399)
>


Mime
View raw message