zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Bailey <ni...@mailtrust.com>
Subject Re: Killing a zookeeper server
Date Wed, 13 Jan 2010 19:19:26 GMT
So the solution for us was to just nuke zookeeper and restart  
everywhere.  We will also be upgrading soon as well.

To answer your question, yes I believe all the servers were running  
normally except for the fact that they were experiencing high CPU  
usage.  As we began to see some CPU alerts I started restarting some  
of the servers.

It was then that we noticed that they were not actually running  
according to 'stat'.

I still have the log from one server with a debug level and the rest  
with a warn level. If you would like to see any of these and analyze  
them just let me know.

Thanks for the help,
Nick Bailey

On Jan 12, 2010, at 8:20 PM, Patrick Hunt <phunt@apache.org> wrote:

> Nick Bailey wrote:
>> In my last email I failded to include a log line that may be  
>> revelent as well
>> 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181]  
>> (QuorumCnxManager) DEBUG - Queue size: 0
>> 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Notification time out: 6400
>
> Yes, that is significant/interesting. I believe this means that  
> there is some problem with the election process (ie the server re- 
> joining the ensemble). We have a backoff on these attempts, which  
> matches your description below. We have fixed some election issues  
> in recent versions (we introduced fault injection testing prior to  
> the 3.2.1 release which found a few issues with election). I don't  
> have them off hand - but I've asked Flavio to comment directly (he's  
> in diff tz).
>
> Can you provide a bit more background: prior to this issue, this  
> particular server was running fine? You restarted it and then  
> started seeing the issue? (rather than this being a new server I  
> mean). What I'm getting at is that there shouldn't/couldn't be any  
> networking/firewall type issue going on right?
>
> Can you provide a full/more log? What I'd suggest is shut down this  
> one server, clear the log4j log file, then restart it. Let the  
> problem reproduce, then gzip the log4j log file and attach to your  
> response. Ok?
>
> Patrick
>
>> We see this line occur frequently and the timeout will graduatlly  
>> increase to 60000. It appears that all of our servers that seem to  
>> be acting normally are experiencing the cpu issue I mentioned  
>> earlier 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'.  
>> Perhaps that is causing the timeout in responding?
>> Also to answer your other questions Patrick, we aren't storing a  
>> large amount of data really and network latency appears fine.
>> Thanks for the help,
>> Nick
>> -----Original Message-----
>> From: "Nick Bailey" <nicholas.bailey@rackspace.com>
>> Sent: Tuesday, January 12, 2010 6:03pm
>> To: zookeeper-user@hadoop.apache.org
>> Subject: Re: Killing a zookeeper server
>> 12 was just to keep uniformity on our servers. Our clients are  
>> connecting from the same 12 servers.  Easily modifiable and perhaps  
>> we should look into changing that.
>> The logs just seem to indicate that the servers that claim to have  
>> no server running are continually attempting to elect a leader.  A  
>> sample is provided below.  The initial exception is something we  
>> see regularly in our logs and the debug and info lines following  
>> are simply repeating throughout the log.
>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)  
>> WARN  - Exception causing close of session 0x0 due to  
>> java.io.IOException: Read error
>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)  
>> DEBUG - IOException stack trace
>> java.io.IOException: Read error
>>        at org.apache.zookeeper.server.NIOServerCnxn.doIO 
>> (NIOServerCnxn.java:295)
>>        at org.apache.zookeeper.server.NIOServerCnxn$Factory.run 
>> (NIOServerCnxn.java:162)
>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)  
>> INFO  - closing session:0x0 NIOServerCnxn:  
>> java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181  
>> remote=/172.20.36.9:50367]
>> 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn)  
>> DEBUG - ignoring exception during input shutdown
>> java.net.SocketException: Transport endpoint is not connected
>>        at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>>        at sun.nio.ch.SocketChannelImpl.shutdownInput 
>> (SocketChannelImpl.java:640)
>>        at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java: 
>> 360)
>>        at org.apache.zookeeper.server.NIOServerCnxn.close 
>> (NIOServerCnxn.java:767)
>>        at org.apache.zookeeper.server.NIOServerCnxn.doIO 
>> (NIOServerCnxn.java:421)
>>        at org.apache.zookeeper.server.NIOServerCnxn$Factory.run 
>> (NIOServerCnxn.java:162)
>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Adding vote
>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Notification: 3, 30064826826, 1, 9,  
>> LOOKING, LOOKING, 9
>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:  
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Adding vote
>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Notification: 3, 30064826826, 1, 9,  
>> LOOKING, LOOKING, 11
>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:  
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Adding vote
>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Notification: 3, 30064826826, 1, 9,  
>> LOOKING, LOOKING, 12
>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:  
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Adding vote
>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread] (FastLeaderElection 
>> $Messenger$WorkerReceiver) DEBUG - Receive new message.
>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread] (FastLeaderElection 
>> $Messenger$WorkerReceiver) DEBUG - Receive new message.
>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Notification: 3, 30064826826, 1, 9,  
>> LOOKING, LOOKING, 8
>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:  
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Adding vote
>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Notification: 3, 30064826826, 1, 9,  
>> LOOKING, LOOKING, 12
>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:  
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181]  
>> (FastLeaderElection) INFO  - Adding vote
>> 2010-01-12 17:57:52,200 [WorkerReceiver Thread] (FastLeaderElection 
>> $Messenger$WorkerReceiver) DEBUG - Receive new message.
>> 2010-01-12 17:57:52,201 [WorkerReceiver Thread] (FastLeaderElection 
>> $Messenger$WorkerReceiver) DEBUG - Receive new message.
>> -----Original Message-----
>> From: "Patrick Hunt" <phunt@apache.org>
>> Sent: Tuesday, January 12, 2010 5:40pm
>> To: zookeeper-user@hadoop.apache.org, nicholas.bailey@rackspace.com
>> Subject: Re: Killing a zookeeper server
>> 12 servers? That's alot, if you dont' mind my asking why so many?  
>> Typically we recommend 5 - that way you can have one down for  
>> maintenance and still have a failure that doesn't bring down the  
>> cluster.
>> The "electing a leader" is probably the restarted machine  
>> attempting to re-join the ensemble (it should join as a follower if  
>> you have a leader already elected, given that it's xid is behind  
>> the existing leader.) Hard to tell though without the logs.
>> You might also be seeing the initLimit exceeded, is the data you  
>> are storing in ZK large? Or perhaps network connectivity is slow?
>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
>> again the logs would give some insight on this.
>> Patrick
>> Nick Bailey wrote:
>>> We are running zookeeper 3.1.0
>>>
>>> Recently we noticed the cpu usage on our machines becoming
>>> increasingly high and we believe the cause is
>>>
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-427
>>>
>>> However our solution when we noticed the problem was to kill the
>>> zookeeper process and restart it.
>>>
>>> After doing that though it looks like the newly restarted zookeeper
>>> server is continually attempting to elect a leader even though one
>>> already exists.
>>>
>>> The process responses with 'imok' when asked, but the stat command
>>> returns 'ZooKeeperServer not running'.
>>>
>>> I belive that killing the current leader should trigger all servers
>>> to do an election and solve the problem, but I'm not sure. Should
>>> that be the course of action in this situation?
>>>
>>> Also we have 12 servers, but 5 are currently not running according  
>>> to
>>> stat.  So I guess this isn't a problem unless we lose another one.
>>> We have plans to upgrade zookeeper to solve the cpu issue but  
>>> haven't
>>> been able to do that yet.
>>>
>>> Any help appreciated, Nick Bailey
>>>

Mime
View raw message