zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Spence <Ian.Spe...@globalrelay.net>
Subject Re: [**SPAM**] Re: [**SPAM**] RE: ZK Server does not join quorum after restart
Date Fri, 25 Jan 2019 18:19:26 GMT
Hi Andor,

As this is on a production server, I can’t attach the log file entirely, but I can try and
get you as much information as I can:

Nearly all of the log file is filled with connection errors from ZooKeeper clients:

> WARN NIOServerCnxn – Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
> INFO NIOServerCnxn – Closed socket connection for client /<redacted> (no session
established for client)

I grabbed all of the IP addresses in the log file and they’re all from clients, no mention
of other ZK servers.

Looking at ‘Quorum’, I see a lot of:

> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO  FastLeaderElection - Notification time
out: 60000
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO  QuorumCnxManager - Have smaller server
identifier, so dropping the connection: (2, 1)
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO  QuorumCnxManager - Have smaller server
identifier, so dropping the connection: (3, 1)
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO  QuorumCnxManager - Have smaller server
identifier, so dropping the connection: (4, 1)
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO  QuorumCnxManager - Have smaller server
identifier, so dropping the connection: (5, 1)

Let me know if there is anything else you think I should look for. If I find anything interesting
I’ll share it here.



From: Andor Molnar <andor@cloudera.com.INVALID>
Reply-To: "user@zookeeper.apache.org" <user@zookeeper.apache.org>
Date: Friday, January 25, 2019 at 10:01
To: "user@zookeeper.apache.org" <user@zookeeper.apache.org>
Subject: [**SPAM**] Re: [**SPAM**] RE: ZK Server does not join quorum after restart

Hi Ian,

Would you please attach logs from all participants of the ensemble or try
to find an exception from when the follower is trying to join?

Regards,
Andor



On Fri, Jan 25, 2019 at 1:37 AM Ian Spence <Ian.Spence@globalrelay.net<mailto:Ian.Spence@globalrelay.net>>
wrote:

Hi Daniel,

Thanks for the quick reply. We use static IP addresses on all of the
servers so it did not change after the reboot.

Thanks,
-Ian

From: Daniel Chan <daniel.cw.chan@oracle.com<mailto:daniel.cw.chan@oracle.com>>
on behalf of Daniel Chan <
daniel.cw.chan@oracle.com<mailto:daniel.cw.chan@oracle.com>>
Reply-To: "user@zookeeper.apache.org<mailto:user@zookeeper.apache.org>" <user@zookeeper.apache.org<mailto:user@zookeeper.apache.org>>
Date: Thursday, January 24, 2019 at 16:36
To: "user@zookeeper.apache.org<mailto:user@zookeeper.apache.org>" <user@zookeeper.apache.org<mailto:user@zookeeper.apache.org>>
Subject: [**SPAM**] RE: ZK Server does not join quorum after restart


If its IP address got changed, then you hit a known bug
https://issues.apache.org/jira/browse/ZOOKEEPER-1506  and you need to
bounce the cluster.

Thanks,
Daniel

-----Original Message-----
From: Ian Spence <Ian.Spence@globalrelay.net<mailto:Ian.Spence@globalrelay.net><mailto:
Ian.Spence@globalrelay.net<mailto:Ian.Spence@globalrelay.net>>>
Sent: Thursday, January 24, 2019 2:36 PM
To: user@zookeeper.apache.org<mailto:user@zookeeper.apache.org><mailto:user@zookeeper.apache.org>
Subject: ZK Server does not join quorum after restart

Hello

We have a cluster of 5 ZK servers, all running ZK 3.4.6 on Java 1.8 on
CentOS 6. These are physical devices, not virtual machines.

One server required hardware maintenance, and was restarted. When the zk
software was restarted, it did not rejoin the quorum as a follower.

Running “stat” or “mntr” commands returns: “This ZooKeeper instance is not
currently serving requests”

I googled this message and came across this bug:
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ZOOKEEPER-2D2164&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=JE3yjNS4hXa8nS9n2uFCwEqMvv18hzzEnqunUhCoEns&m=S_8TazqwUbEfRtAYQCn8kA7F2tiGUBaVr3c_nj0Fh8A&s=FGIs9YOjwdYrzBH8om70Jx11KemHKRDsMY_kZK6cpK0&e=

Does anybody know if there is a work-around to this issue? We’ve seen this
problem multiple times in the past and our current solution is to bring
down the zk cluster (which is a huge outage-causing pain).

Thanks

- Ian



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message