1) Capture the logs from all 5 servers
2) give the config for the "down" server, also indicate that it's server
id is.
3) if possible it would be interesting to see the netstat information
from 2 of the servers - the one that's down and one or more of the others.
Patrick
Jean-Daniel Cryans wrote:
> I believe we've just hit the same problem with zk-3.2.1
>
> For some reason a machine crashed and it was part of our quorum of 5
> servers. When we try to restart it it this does this (I replaced
> hostname and IP):
>
> 2010-01-25 10:25:06,469 WARN
> org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open
> channel to 1 at election address somehost1/someip1:3888
> java.net.ConnectException: Connection refused
> at sun.nio.ch.Net.connect(Native Method)
> at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
> at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
> at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356)
> at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)
>
> It has been like that for almost 20 minutes now, trying every other
> server in the quorum on different channels. ruok says imok but all
> other commands say that ZK server isn't running. I don't believe that
> 3.2.2 will help unless ZK-547 does more than it seems to.
>
> Any else I should look at?
>
> Thx!
>
> J-D
>
> On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey <nickb@mailtrust.com> wrote:
>> So the solution for us was to just nuke zookeeper and restart everywhere.
>> We will also be upgrading soon as well.
>>
>> To answer your question, yes I believe all the servers were running normally
>> except for the fact that they were experiencing high CPU usage. As we began
>> to see some CPU alerts I started restarting some of the servers.
>>
>> It was then that we noticed that they were not actually running according to
>> 'stat'.
>>
>> I still have the log from one server with a debug level and the rest with a
>> warn level. If you would like to see any of these and analyze them just let
>> me know.
>>
>> Thanks for the help,
>> Nick Bailey
>>
|