hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Bailey" <ni...@mailtrust.com>
Subject Re: Killing a zookeeper server
Date Wed, 13 Jan 2010 22:19:48 GMT
Additionally, in this case the stat command simply stated that the zookeeper server was not
running, which is obviously in conflict with an 'imok' response.  In order to determine that
the server was attempting to rejoin/elect a leader I needed to enable debug logging, restart
the process, and check the log.

So from a user perspective some more info from the stat command in this case would be very
helpful.

-Nick


On Wednesday, January 13, 2010 4:39pm, "Mahadev Konar" <mahadev@yahoo-inc.com> said:

> HI Adam,
>  The 'ruok' command at the server does not have much checks in place (infact
> it does not have any checks in place). It just gives you an idea that the
> server is really running. The 'stat' command is meant to analyze the
> server's state.
> 
> 
> So you should infact file an improvement wherein there are atleast some
> checks in place for 'ruok' command to return 'imok', so that the users know
> there servers are running healthy.
> 
> 
> Thanks
> mahadev
> 
> 
> On 1/13/10 1:29 PM, "Adam Rosien" <adam@rosien.net> wrote:
> 
>> Yes, leader election either failed or was in progress yet 'ruok'
>> returned 'imok'.
>>
>> .. Adam
>>
>> On Wed, Jan 13, 2010 at 1:08 PM, Mahadev Konar <mahadev@yahoo-inc.com> wrote:
>>> Hi Adam,
>>>  That seems fair to file as an improvement. Running 'stat' did return the
>>> right stats right? Saying the servers werent able to elect a leader?
>>>
>>> mahadev
>>>
>>>
>>> On 1/13/10 11:52 AM, "Adam Rosien" <adam@rosien.net> wrote:
>>>
>>>> On a related note, it was initially confusing to me that the server
>>>> returned 'imok' when it wasn't part of the quorum. I realize the
>>>> internal checks are probably in separate areas of the code, but if
>>>> others feel similarly I could file an improvement in JIRA.
>>>>
>>>> .. Adam
>>>>
>>>> On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey <nickb@mailtrust.com>
wrote:
>>>>> So the solution for us was to just nuke zookeeper and restart everywhere.
>>>>>  We will also be upgrading soon as well.
>>>>>
>>>>> To answer your question, yes I believe all the servers were running
>>>>> normally
>>>>> except for the fact that they were experiencing high CPU usage.  As
we
>>>>> began
>>>>> to see some CPU alerts I started restarting some of the servers.
>>>>>
>>>>> It was then that we noticed that they were not actually running according
>>>>> to
>>>>> 'stat'.
>>>>>
>>>>> I still have the log from one server with a debug level and the rest
with a
>>>>> warn level. If you would like to see any of these and analyze them just
let
>>>>> me know.
>>>>>
>>>>> Thanks for the help,
>>>>> Nick Bailey
>>>>>
>>>>> On Jan 12, 2010, at 8:20 PM, Patrick Hunt <phunt@apache.org> wrote:
>>>>>
>>>>>> Nick Bailey wrote:
>>>>>>>
>>>>>>> In my last email I failded to include a log line that may be
revelent as
>>>>>>> well
>>>>>>> 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager)
>>>>>>> DEBUG - Queue size: 0
>>>>>>> 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Notification time out: 6400
>>>>>>
>>>>>> Yes, that is significant/interesting. I believe this means that there
is
>>>>>> some problem with the election process (ie the server re-joining
the
>>>>>> ensemble). We have a backoff on these attempts, which matches your
>>>>>> description below. We have fixed some election issues in recent versions
>>>>>> (we
>>>>>> introduced fault injection testing prior to the 3.2.1 release which
found
>>>>>> a
>>>>>> few issues with election). I don't have them off hand - but I've
asked
>>>>>> Flavio to comment directly (he's in diff tz).
>>>>>>
>>>>>> Can you provide a bit more background: prior to this issue, this
>>>>>> particular server was running fine? You restarted it and then started
>>>>>> seeing
>>>>>> the issue? (rather than this being a new server I mean). What I'm
getting
>>>>>> at
>>>>>> is that there shouldn't/couldn't be any networking/firewall type
issue
>>>>>> going
>>>>>> on right?
>>>>>>
>>>>>> Can you provide a full/more log? What I'd suggest is shut down this
one
>>>>>> server, clear the log4j log file, then restart it. Let the problem
>>>>>> reproduce, then gzip the log4j log file and attach to your response.
Ok?
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>>> We see this line occur frequently and the timeout will graduatlly
>>>>>>> increase to 60000. It appears that all of our servers that seem
to be
>>>>>>> acting
>>>>>>> normally are experiencing the cpu issue I mentioned earlier
>>>>>>> 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps
that is
>>>>>>> causing the timeout in responding?
>>>>>>> Also to answer your other questions Patrick, we aren't storing
a large
>>>>>>> amount of data really and network latency appears fine.
>>>>>>> Thanks for the help,
>>>>>>> Nick
>>>>>>> -----Original Message-----
>>>>>>> From: "Nick Bailey" <nicholas.bailey@rackspace.com>
>>>>>>> Sent: Tuesday, January 12, 2010 6:03pm
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: Re: Killing a zookeeper server
>>>>>>> 12 was just to keep uniformity on our servers. Our clients are
connecting
>>>>>>> from the same 12 servers.  Easily modifiable and perhaps we
should
>>>>>>> look
>>>>>>> into
>>>>>>> changing that.
>>>>>>> The logs just seem to indicate that the servers that claim to
have no
>>>>>>> server running are continually attempting to elect a leader.
 A sample
>>>>>>> is
>>>>>>> provided below.  The initial exception is something we see regularly
>>>>>>> in
>>>>>>> our
>>>>>>> logs and the debug and info lines following are simply repeating
>>>>>>> throughout
>>>>>>> the log.
>>>>>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)
WARN
>>>>>>>  - Exception causing close of session 0x0 due to java.io.IOException:
>>>>>>> Read
>>>>>>> error
>>>>>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)
DEBUG
>>>>>>> - IOException stack trace
>>>>>>> java.io.IOException: Read error
>>>>>>>       at
>>>>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
>>>>>>>       at
>>>>>>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:
>>>>>>> 16
>>>>>>> 2)
>>>>>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)
INFO
>>>>>>>  - closing session:0x0 NIOServerCnxn:
>>>>>>> java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181
>>>>>>> remote=/172.20.36.9:50367]
>>>>>>> 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn)
DEBUG
>>>>>>> - ignoring exception during input shutdown
>>>>>>> java.net.SocketException: Transport endpoint is not connected
>>>>>>>       at sun.nio.ch.SocketChannelImpl.shutdown(Native
>>>>>>> Method)
>>>>>>>       at
>>>>>>> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>>>>>>>       at
>>>>>>> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>>>>>>>       at
>>>>>>> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
>>>>>>>       at
>>>>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
>>>>>>>       at
>>>>>>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:
>>>>>>> 16
>>>>>>> 2)
>>>>>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Adding vote
>>>>>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING,
9
>>>>>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>>>>> 30064826826
>>>>>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Adding vote
>>>>>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING,
11
>>>>>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>>>>> 30064826826
>>>>>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Adding vote
>>>>>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING,
12
>>>>>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>>>>> 30064826826
>>>>>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Adding vote
>>>>>>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread]
>>>>>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive
new
>>>>>>> message.
>>>>>>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread]
>>>>>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive
new
>>>>>>> message.
>>>>>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING,
8
>>>>>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>>>>> 30064826826
>>>>>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Adding vote
>>>>>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING,
12
>>>>>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>>>>> 30064826826
>>>>>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>>>>> INFO  - Adding vote
>>>>>>> 2010-01-12 17:57:52,200 [WorkerReceiver Thread]
>>>>>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive
new
>>>>>>> message.
>>>>>>> 2010-01-12 17:57:52,201 [WorkerReceiver Thread]
>>>>>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive
new
>>>>>>> message.
>>>>>>> -----Original Message-----
>>>>>>> From: "Patrick Hunt" <phunt@apache.org>
>>>>>>> Sent: Tuesday, January 12, 2010 5:40pm
>>>>>>> To: zookeeper-user@hadoop.apache.org, nicholas.bailey@rackspace.com
>>>>>>> Subject: Re: Killing a zookeeper server
>>>>>>> 12 servers? That's alot, if you dont' mind my asking why so many?
>>>>>>> Typically we recommend 5 - that way you can have one down for
maintenance
>>>>>>> and still have a failure that doesn't bring down the cluster.
>>>>>>> The "electing a leader" is probably the restarted machine attempting
to
>>>>>>> re-join the ensemble (it should join as a follower if you have
a leader
>>>>>>> already elected, given that it's xid is behind the existing leader.)
Hard
>>>>>>> to
>>>>>>> tell though without the logs.
>>>>>>> You might also be seeing the initLimit exceeded, is the data
you are
>>>>>>> storing in ZK large? Or perhaps network connectivity is slow?
>>>>>>>
>>>>>>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_cl
>>>>>>> us
>>>>>>> terOptions
>>>>>>> again the logs would give some insight on this.
>>>>>>> Patrick
>>>>>>> Nick Bailey wrote:
>>>>>>>>
>>>>>>>> We are running zookeeper 3.1.0
>>>>>>>>
>>>>>>>> Recently we noticed the cpu usage on our machines becoming
>>>>>>>> increasingly high and we believe the cause is
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-427
>>>>>>>>
>>>>>>>> However our solution when we noticed the problem was to kill
the
>>>>>>>> zookeeper process and restart it.
>>>>>>>>
>>>>>>>> After doing that though it looks like the newly restarted
zookeeper
>>>>>>>> server is continually attempting to elect a leader even though
one
>>>>>>>> already exists.
>>>>>>>>
>>>>>>>> The process responses with 'imok' when asked, but the stat
command
>>>>>>>> returns 'ZooKeeperServer not running'.
>>>>>>>>
>>>>>>>> I belive that killing the current leader should trigger all
servers
>>>>>>>> to do an election and solve the problem, but I'm not sure.
Should
>>>>>>>> that be the course of action in this situation?
>>>>>>>>
>>>>>>>> Also we have 12 servers, but 5 are currently not running
according to
>>>>>>>> stat.  So I guess this isn't a problem unless we lose another
one.
>>>>>>>> We have plans to upgrade zookeeper to solve the cpu issue
but haven't
>>>>>>>> been able to do that yet.
>>>>>>>>
>>>>>>>> Any help appreciated, Nick Bailey
>>>>>>>>
>>>>>
>>>
>>>
> 
> 



Mime
View raw message