hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Killing a zookeeper server
Date Wed, 13 Jan 2010 01:20:58 GMT
Nick Bailey wrote:
> In my last email I failded to include a log line that may be revelent as well
> 
> 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager) DEBUG - Queue size:
0
> 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Notification
time out: 6400

Yes, that is significant/interesting. I believe this means that there is 
some problem with the election process (ie the server re-joining the 
ensemble). We have a backoff on these attempts, which matches your 
description below. We have fixed some election issues in recent versions 
(we introduced fault injection testing prior to the 3.2.1 release which 
found a few issues with election). I don't have them off hand - but I've 
asked Flavio to comment directly (he's in diff tz).

Can you provide a bit more background: prior to this issue, this 
particular server was running fine? You restarted it and then started 
seeing the issue? (rather than this being a new server I mean). What I'm 
getting at is that there shouldn't/couldn't be any networking/firewall 
type issue going on right?

Can you provide a full/more log? What I'd suggest is shut down this one 
server, clear the log4j log file, then restart it. Let the problem 
reproduce, then gzip the log4j log file and attach to your response. Ok?

Patrick

> 
> We see this line occur frequently and the timeout will graduatlly increase to 60000.
It appears that all of our servers that seem to be acting normally are experiencing the cpu
issue I mentioned earlier 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that
is causing the timeout in responding?
> 
> Also to answer your other questions Patrick, we aren't storing a large amount of data
really and network latency appears fine.
> 
> Thanks for the help,
> Nick
> 
> 
> -----Original Message-----
> From: "Nick Bailey" <nicholas.bailey@rackspace.com>
> Sent: Tuesday, January 12, 2010 6:03pm
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Killing a zookeeper server
> 
> 12 was just to keep uniformity on our servers. Our clients are connecting from the same
12 servers.  Easily modifiable and perhaps we should look into changing that.
> 
> 
> The logs just seem to indicate that the servers that claim to have no server running
are continually attempting to elect a leader.  A sample is provided below.  The initial exception
is something we see regularly in our logs and the debug and info lines following are simply
repeating throughout the log.
> 
> 
> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN  - Exception
causing close of session 0x0 due to java.io.IOException: Read error
> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG - IOException
stack trace
> java.io.IOException: Read error
>         at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
>         at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO  - closing session:0x0
NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181 remote=/172.20.36.9:50367]
> 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG - ignoring
exception during input shutdown
> java.net.SocketException: Transport endpoint is not connected
>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>         at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>         at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>         at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
>         at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
>         at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
> 
> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Adding
vote
> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Notification:
3, 30064826826, 1, 9, LOOKING, LOOKING, 9
> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - id: 3,
proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Adding
vote
> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Notification:
3, 30064826826, 1, 9, LOOKING, LOOKING, 11
> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - id: 3,
proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Adding
vote
> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Notification:
3, 30064826826, 1, 9, LOOKING, LOOKING, 12
> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - id: 3,
proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Adding
vote
> 2010-01-12 17:56:52,190 [WorkerReceiver Thread] (FastLeaderElection$Messenger$WorkerReceiver)
DEBUG - Receive new message.
> 2010-01-12 17:56:52,190 [WorkerReceiver Thread] (FastLeaderElection$Messenger$WorkerReceiver)
DEBUG - Receive new message.
> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Notification:
3, 30064826826, 1, 9, LOOKING, LOOKING, 8
> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - id: 3,
proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Adding
vote
> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Notification:
3, 30064826826, 1, 9, LOOKING, LOOKING, 12
> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - id: 3,
proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - Adding
vote
> 2010-01-12 17:57:52,200 [WorkerReceiver Thread] (FastLeaderElection$Messenger$WorkerReceiver)
DEBUG - Receive new message.
> 2010-01-12 17:57:52,201 [WorkerReceiver Thread] (FastLeaderElection$Messenger$WorkerReceiver)
DEBUG - Receive new message.
> 
> 
> -----Original Message-----
> From: "Patrick Hunt" <phunt@apache.org>
> Sent: Tuesday, January 12, 2010 5:40pm
> To: zookeeper-user@hadoop.apache.org, nicholas.bailey@rackspace.com
> Subject: Re: Killing a zookeeper server
> 
> 12 servers? That's alot, if you dont' mind my asking why so many? 
> Typically we recommend 5 - that way you can have one down for 
> maintenance and still have a failure that doesn't bring down the cluster.
> 
> The "electing a leader" is probably the restarted machine attempting to 
> re-join the ensemble (it should join as a follower if you have a leader 
> already elected, given that it's xid is behind the existing leader.) 
> Hard to tell though without the logs.
> 
> You might also be seeing the initLimit exceeded, is the data you are 
> storing in ZK large? Or perhaps network connectivity is slow?
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
> again the logs would give some insight on this.
> 
> 
> Patrick
> 
> Nick Bailey wrote:
>> We are running zookeeper 3.1.0
>>
>> Recently we noticed the cpu usage on our machines becoming
>> increasingly high and we believe the cause is
>>
>> https://issues.apache.org/jira/browse/ZOOKEEPER-427
>>
>> However our solution when we noticed the problem was to kill the
>> zookeeper process and restart it.
>>
>> After doing that though it looks like the newly restarted zookeeper
>> server is continually attempting to elect a leader even though one
>> already exists.
>>
>> The process responses with 'imok' when asked, but the stat command
>> returns 'ZooKeeperServer not running'.
>>
>> I belive that killing the current leader should trigger all servers
>> to do an election and solve the problem, but I'm not sure. Should
>> that be the course of action in this situation?
>>
>> Also we have 12 servers, but 5 are currently not running according to
>> stat.  So I guess this isn't a problem unless we lose another one.
>> We have plans to upgrade zookeeper to solve the cpu issue but haven't
>> been able to do that yet.
>>
>> Any help appreciated, Nick Bailey
>>
> 
> 
> 
> 

Mime
View raw message