From issues-return-129-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Sun Jun 16 19:33:02 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id ABFA8180662 for ; Sun, 16 Jun 2019 21:33:02 +0200 (CEST) Received: (qmail 64015 invoked by uid 500); 16 Jun 2019 19:33:02 -0000 Mailing-List: contact issues-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list issues@zookeeper.apache.org Received: (qmail 63997 invoked by uid 99); 16 Jun 2019 19:33:02 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Jun 2019 19:33:02 +0000 Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 746FEE02F1 for ; Sun, 16 Jun 2019 19:33:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0DDCE2462F for ; Sun, 16 Jun 2019 19:33:00 +0000 (UTC) Date: Sun, 16 Jun 2019 19:33:00 +0000 (UTC) From: "Sebastian Schmitz (JIRA)" To: issues@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-2938) Server is unable to join quorum after connection broken to other peers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865134#comment-16865134 ] Sebastian Schmitz commented on ZOOKEEPER-2938: ---------------------------------------------- I deployed the 3.5.5 in our Test-Environment after it was running fine in Dev, but there it went crazy... Was not able to update Kafka without the mentioned errors and manually restarting some Zookeeper-leaders. So this Sunday I changed back to 3.4.14 which is running fine without the problem... It's a bit strange that it's running fine in Dev, which is also 3 node Kafka-cluster with 3 Zookeepers.... Test just adds some mirroring to another site with 3 node-cluster in each location, but as this is more related to Kafka it shouldn't have any impact on the Zookeepers. The errors happen in both sites for the zookeepers during an update of the cluster which re-deploys the Docker-Images... Of course all the images are from same repo with same tags... Just tell me what you need and I can send log-files, configuration etc. > Server is unable to join quorum after connection broken to other peers > ---------------------------------------------------------------------- > > Key: ZOOKEEPER-2938 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2938 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.4.6 > Reporter: Abhay Bothra > Priority: Major > > We see the following logs in the node with {{myid: 1}} > {code} > 2017-11-08 15:06:28,375 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1) > 2017-11-08 15:06:28,375 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1) > 2017-11-08 15:07:28,375 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x28e000a8750 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x28e (n.peerEpoch) LOOKING (my state) > 2017-11-08 15:07:28,375 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1) > 2017-11-08 15:07:28,376 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1) > 2017-11-08 15:08:28,375 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x28e000a8750 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x28e (n.peerEpoch) LOOKING (my state) > 2017-11-08 15:08:28,376 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1) > 2017-11-08 15:08:28,376 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1) > 2017-11-08 15:09:28,376 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x28e000a8750 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x28e (n.peerEpoch) LOOKING (my state) > 2017-11-08 15:09:28,376 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1) > 2017-11-08 15:09:28,376 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1) > 2017-11-08 15:10:28,376 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x28e000a8750 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x28e (n.peerEpoch) LOOKING (my state) > 2017-11-08 15:10:28,376 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (2, 1) > 2017-11-08 15:10:28,377 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier, so dropping the connection: (3, 1) > {code} > On the nodes with {{myid: 2}} and {{myid: 3}}, we see connection broken events for {{myid: 1}} > {code} > 2017-11-07 02:54:32,135 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@780] - Connection broken for id 1, my id = 2, error = > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:209) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.net.SocketInputStream.read(SocketInputStream.java:223) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765) > 2017-11-07 02:54:32,135 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker > 2017-11-07 02:54:32,135 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@697] - Interrupted while waiting for message on queue > java.lang.InterruptedException > at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:849) > at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:64) > at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:685) > 2017-11-07 02:54:32,135 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@706] - Send worker leaving thread > {code} > From the reported occurrences, it looks like this is a problem only when the node with the smallest {{myid}} loses connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005)