Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 48E69200C67 for ; Mon, 15 May 2017 14:54:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 47880160BC2; Mon, 15 May 2017 12:54:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8D669160BC1 for ; Mon, 15 May 2017 14:54:14 +0200 (CEST) Received: (qmail 1769 invoked by uid 500); 15 May 2017 12:54:08 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 1746 invoked by uid 99); 15 May 2017 12:54:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 May 2017 12:54:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3881E180692 for ; Mon, 15 May 2017 12:54:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id NsJm32KSqE1c for ; Mon, 15 May 2017 12:54:07 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id EE1645FE31 for ; Mon, 15 May 2017 12:54:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 30C3BE0BCD for ; Mon, 15 May 2017 12:54:06 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 28EC7242F5 for ; Mon, 15 May 2017 12:54:05 +0000 (UTC) Date: Mon, 15 May 2017 12:54:05 +0000 (UTC) From: "Dan (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-3984) Broker doesn't retry reconnecting to an expired Zookeeper connection MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 15 May 2017 12:54:15 -0000 [ https://issues.apache.org/jira/browse/KAFKA-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010436#comment-16010436 ] Dan commented on KAFKA-3984: ---------------------------- We encountered the same problem in 0.10.1.1. Is there any plan to fix it? > Broker doesn't retry reconnecting to an expired Zookeeper connection > -------------------------------------------------------------------- > > Key: KAFKA-3984 > URL: https://issues.apache.org/jira/browse/KAFKA-3984 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.9.0.1 > Reporter: Braedon Vickers > > We've been having issues with the network connectivity of our Kafka cluster, and this seems to be triggering an issue where the brokers stop trying to reconnect to Zookeeper, leaving us with a broken cluster even when the network has recovered. > When network issues begin we see {{java.net.NoRouteToHostException}} exceptions from {{org.apache.zookeeper.ClientCnxn}} as it attempts to re-establish the connection. If the network issue resolves itself while we are only getting these errors the broker seems to reconnect fine. > However, a lot of the time we end up with a message like this: > {code}[2016-07-22 00:21:44,181] FATAL Could not establish session with zookeeper (kafka.server.KafkaHealthcheck) > org.I0Itec.zkclient.exception.ZkException: Unable to connect to > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71) > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279) > ... > Caused by: java.net.UnknownHostException: > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61) > at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:445) > ... > {code} > (apologies for the partial stack traces - I'm having to try and reconstruct them from a less than ideal centralised logging setup.) > If this happens, the broker stops trying to reconnect to Zookeeper, and we have to restart it. > It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state isn't {{Expired}} it will keep retrying the connection, and will recover OK when the network is back. However, once it changes to {{Expired}} (not entirely sure how that happens - based on the session timeout perhaps?) zkclient closes the existing client and attempts to create a new one. If the network is still down, the client constructor throws a {{java.net.UnknownHostException}}, zkclient calls {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}}, {{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error and does nothing else. > It seems like some form of retry needs to happen here, or the broker is stuck with no Zookeeper connection indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to kill the JVM, but that was removed in https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be better than doing nothing, as then your init system could restart it, allowing it to recover once the network was back. > Our cluster is running 0.9.0.1, so not sure if it affects 0.10.0.0 as well. However, it seems likely, as there doesn't seem to be any code changes in kafka or zkclient that would affect this behaviour. -- This message was sent by Atlassian JIRA (v6.3.15#6346)