kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Braedon Vickers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-3984) Broker doesn't retry reconnecting to an expired Zookeeper connection
Date Fri, 22 Jul 2016 05:41:20 GMT
Braedon Vickers created KAFKA-3984:
--------------------------------------

             Summary: Broker doesn't retry reconnecting to an expired Zookeeper connection
                 Key: KAFKA-3984
                 URL: https://issues.apache.org/jira/browse/KAFKA-3984
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.9.0.1
            Reporter: Braedon Vickers


We've been having issues with the network connectivity of our Kafka cluster, and this seems
to be triggering an issue where the brokers stop trying to reconnect to Zookeeper, leaving
us with a broken cluster even when the network has recovered.

When network issues begin we see {{java.net.NoRouteToHostException}} exceptions from {{org.apache.zookeeper.ClientCnxn}}
as it attempts to re-establish the connection. If the network issue resolves itself while
we are only getting these errors the broker seems to reconnect fine.

However, a lot of the time we end up with a message like this:
{code}[2016-07-22 00:21:44,181] FATAL Could not establish session with zookeeper (kafka.server.KafkaHealthcheck)
org.I0Itec.zkclient.exception.ZkException: Unable to connect to <zookeeper hosts>
	at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71)
	at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279)
...
Caused by: java.net.UnknownHostException: <zookeeper host>
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
...
{code}
(apologies for the partial stack traces - I'm having to try and reconstruct them from a less
than ideal centralised logging setup.)

If this happens, the broker stops trying to reconnect to Zookeeper, and we have to restart
it.

It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state isn't {{Expired}}
it will keep retrying the connection, and will recover OK when the network is back. However,
once it changes to {{Expired}} (not entirely sure how that happens - based on the session
timeout perhaps?) zkclient closes the existing client and attempts to create a new one. If
the network is still down, the client constructor throws a {{java.net.UnknownHostException}},
zkclient calls {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}}, {{KafkaHealthcheck.handleSessionEstablishmentError()}}
logs a "Fatal" error and does nothing else.

It seems like some form of retry needs to happen here, or the broker is stuck with no Zookeeper
connection indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to kill
the JVM, but that was removed in https://issues.apache.org/jira/browse/KAFKA-2405. Killing
the JVM would be better than doing nothing, as then your init system could restart it, allowing
it to recover once the network was back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message