So the solution for us was to just nuke zookeeper and restart
everywhere. We will also be upgrading soon as well.
To answer your question, yes I believe all the servers were running
normally except for the fact that they were experiencing high CPU
usage. As we began to see some CPU alerts I started restarting some
of the servers.
It was then that we noticed that they were not actually running
according to 'stat'.
I still have the log from one server with a debug level and the rest
with a warn level. If you would like to see any of these and analyze
them just let me know.
Thanks for the help,
Nick Bailey
On Jan 12, 2010, at 8:20 PM, Patrick Hunt <phunt@apache.org> wrote:
> Nick Bailey wrote:
>> In my last email I failded to include a log line that may be
>> revelent as well
>> 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181]
>> (QuorumCnxManager) DEBUG - Queue size: 0
>> 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Notification time out: 6400
>
> Yes, that is significant/interesting. I believe this means that
> there is some problem with the election process (ie the server re-
> joining the ensemble). We have a backoff on these attempts, which
> matches your description below. We have fixed some election issues
> in recent versions (we introduced fault injection testing prior to
> the 3.2.1 release which found a few issues with election). I don't
> have them off hand - but I've asked Flavio to comment directly (he's
> in diff tz).
>
> Can you provide a bit more background: prior to this issue, this
> particular server was running fine? You restarted it and then
> started seeing the issue? (rather than this being a new server I
> mean). What I'm getting at is that there shouldn't/couldn't be any
> networking/firewall type issue going on right?
>
> Can you provide a full/more log? What I'd suggest is shut down this
> one server, clear the log4j log file, then restart it. Let the
> problem reproduce, then gzip the log4j log file and attach to your
> response. Ok?
>
> Patrick
>
>> We see this line occur frequently and the timeout will graduatlly
>> increase to 60000. It appears that all of our servers that seem to
>> be acting normally are experiencing the cpu issue I mentioned
>> earlier 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'.
>> Perhaps that is causing the timeout in responding?
>> Also to answer your other questions Patrick, we aren't storing a
>> large amount of data really and network latency appears fine.
>> Thanks for the help,
>> Nick
>> -----Original Message-----
>> From: "Nick Bailey" <nicholas.bailey@rackspace.com>
>> Sent: Tuesday, January 12, 2010 6:03pm
>> To: zookeeper-user@hadoop.apache.org
>> Subject: Re: Killing a zookeeper server
>> 12 was just to keep uniformity on our servers. Our clients are
>> connecting from the same 12 servers. Easily modifiable and perhaps
>> we should look into changing that.
>> The logs just seem to indicate that the servers that claim to have
>> no server running are continually attempting to elect a leader. A
>> sample is provided below. The initial exception is something we
>> see regularly in our logs and the debug and info lines following
>> are simply repeating throughout the log.
>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)
>> WARN - Exception causing close of session 0x0 due to
>> java.io.IOException: Read error
>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)
>> DEBUG - IOException stack trace
>> java.io.IOException: Read error
>> at org.apache.zookeeper.server.NIOServerCnxn.doIO
>> (NIOServerCnxn.java:295)
>> at org.apache.zookeeper.server.NIOServerCnxn$Factory.run
>> (NIOServerCnxn.java:162)
>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)
>> INFO - closing session:0x0 NIOServerCnxn:
>> java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181
>> remote=/172.20.36.9:50367]
>> 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn)
>> DEBUG - ignoring exception during input shutdown
>> java.net.SocketException: Transport endpoint is not connected
>> at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>> at sun.nio.ch.SocketChannelImpl.shutdownInput
>> (SocketChannelImpl.java:640)
>> at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:
>> 360)
>> at org.apache.zookeeper.server.NIOServerCnxn.close
>> (NIOServerCnxn.java:767)
>> at org.apache.zookeeper.server.NIOServerCnxn.doIO
>> (NIOServerCnxn.java:421)
>> at org.apache.zookeeper.server.NIOServerCnxn$Factory.run
>> (NIOServerCnxn.java:162)
>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Adding vote
>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Notification: 3, 30064826826, 1, 9,
>> LOOKING, LOOKING, 9
>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Adding vote
>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Notification: 3, 30064826826, 1, 9,
>> LOOKING, LOOKING, 11
>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Adding vote
>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Notification: 3, 30064826826, 1, 9,
>> LOOKING, LOOKING, 12
>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Adding vote
>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread] (FastLeaderElection
>> $Messenger$WorkerReceiver) DEBUG - Receive new message.
>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread] (FastLeaderElection
>> $Messenger$WorkerReceiver) DEBUG - Receive new message.
>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Notification: 3, 30064826826, 1, 9,
>> LOOKING, LOOKING, 8
>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Adding vote
>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Notification: 3, 30064826826, 1, 9,
>> LOOKING, LOOKING, 12
>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:
>> 30064826826, proposed zxid: 30064826826
>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
>> (FastLeaderElection) INFO - Adding vote
>> 2010-01-12 17:57:52,200 [WorkerReceiver Thread] (FastLeaderElection
>> $Messenger$WorkerReceiver) DEBUG - Receive new message.
>> 2010-01-12 17:57:52,201 [WorkerReceiver Thread] (FastLeaderElection
>> $Messenger$WorkerReceiver) DEBUG - Receive new message.
>> -----Original Message-----
>> From: "Patrick Hunt" <phunt@apache.org>
>> Sent: Tuesday, January 12, 2010 5:40pm
>> To: zookeeper-user@hadoop.apache.org, nicholas.bailey@rackspace.com
>> Subject: Re: Killing a zookeeper server
>> 12 servers? That's alot, if you dont' mind my asking why so many?
>> Typically we recommend 5 - that way you can have one down for
>> maintenance and still have a failure that doesn't bring down the
>> cluster.
>> The "electing a leader" is probably the restarted machine
>> attempting to re-join the ensemble (it should join as a follower if
>> you have a leader already elected, given that it's xid is behind
>> the existing leader.) Hard to tell though without the logs.
>> You might also be seeing the initLimit exceeded, is the data you
>> are storing in ZK large? Or perhaps network connectivity is slow?
>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
>> again the logs would give some insight on this.
>> Patrick
>> Nick Bailey wrote:
>>> We are running zookeeper 3.1.0
>>>
>>> Recently we noticed the cpu usage on our machines becoming
>>> increasingly high and we believe the cause is
>>>
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-427
>>>
>>> However our solution when we noticed the problem was to kill the
>>> zookeeper process and restart it.
>>>
>>> After doing that though it looks like the newly restarted zookeeper
>>> server is continually attempting to elect a leader even though one
>>> already exists.
>>>
>>> The process responses with 'imok' when asked, but the stat command
>>> returns 'ZooKeeperServer not running'.
>>>
>>> I belive that killing the current leader should trigger all servers
>>> to do an election and solve the problem, but I'm not sure. Should
>>> that be the course of action in this situation?
>>>
>>> Also we have 12 servers, but 5 are currently not running according
>>> to
>>> stat. So I guess this isn't a problem unless we lose another one.
>>> We have plans to upgrade zookeeper to solve the cpu issue but
>>> haven't
>>> been able to do that yet.
>>>
>>> Any help appreciated, Nick Bailey
>>>
|