hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Killing a zookeeper server
Date Mon, 25 Jan 2010 20:32:38 GMT
According to the log for 222 it can't open a connection to the election 
port (3888) for any of the other servers. This seems very unusual. Can 
you verify that ther's connectivity on that port btw 222 and all the 
other servers?

Also, can you re-run the netstat with -a option? We can see the listen 
sockets that way (omitted by netstat by default). It would be great if 
you could send the netstat for all 5 servers.

Thanks,

Patrick

Jean-Daniel Cryans wrote:
> Everything is here http://people.apache.org/~jdcryans/zk_election_bug.tar.gz
> 
> The server we are trying to start is sv4borg222 (myid is 2) and we
> started it around 10:03:21
> 
> Thx!
> 
> J-D
> 
> On Mon, Jan 25, 2010 at 10:49 AM, Patrick Hunt <phunt@apache.org> wrote:
>> 1) Capture the logs from all 5 servers
>> 2) give the config for the "down" server, also indicate that it's server id
>> is.
>> 3) if possible it would be interesting to see the netstat information from 2
>> of the servers - the one that's down and one or more of the others.
>>
>> Patrick
>>
>> Jean-Daniel Cryans wrote:
>>> I believe we've just hit the same problem with zk-3.2.1
>>>
>>> For some reason a machine crashed and it was part of our quorum of 5
>>> servers. When we try to restart it it this does this (I replaced
>>> hostname and IP):
>>>
>>> 2010-01-25 10:25:06,469 WARN
>>> org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open
>>> channel to 1 at election address somehost1/someip1:3888
>>> java.net.ConnectException: Connection refused
>>>        at sun.nio.ch.Net.connect(Native Method)
>>>        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
>>>        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356)
>>>        at
>>> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603)
>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)
>>>
>>> It has been like that for almost 20 minutes now, trying every other
>>> server in the quorum on different channels. ruok says imok but all
>>> other commands say that ZK server isn't running. I don't believe that
>>> 3.2.2 will help unless ZK-547 does more than it seems to.
>>>
>>> Any else I should look at?
>>>
>>> Thx!
>>>
>>> J-D
>>>
>>> On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey <nickb@mailtrust.com> wrote:
>>>> So the solution for us was to just nuke zookeeper and restart everywhere.
>>>>  We will also be upgrading soon as well.
>>>>
>>>> To answer your question, yes I believe all the servers were running
>>>> normally
>>>> except for the fact that they were experiencing high CPU usage.  As we
>>>> began
>>>> to see some CPU alerts I started restarting some of the servers.
>>>>
>>>> It was then that we noticed that they were not actually running according
>>>> to
>>>> 'stat'.
>>>>
>>>> I still have the log from one server with a debug level and the rest with
>>>> a
>>>> warn level. If you would like to see any of these and analyze them just
>>>> let
>>>> me know.
>>>>
>>>> Thanks for the help,
>>>> Nick Bailey
>>>>

Mime
View raw message