kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ismael Juma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4418) Broker Leadership Election Fails If Missing ZK Path Raises Exception
Date Thu, 17 Nov 2016 15:43:58 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674023#comment-15674023
] 

Ismael Juma commented on KAFKA-4418:
------------------------------------

Thanks for the report. Why was the path missing?

> Broker Leadership Election Fails If Missing ZK Path Raises Exception
> --------------------------------------------------------------------
>
>                 Key: KAFKA-4418
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4418
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.1, 0.10.0.0, 0.10.0.1
>            Reporter: Michael Pedersen
>
> Our Kafka cluster went down because a single node went down *and* a path in Zookeeper
was missing for one topic (/brokers/topics/<topicname>/partitions). When this occurred,
leadership election could not run, and produced a stack trace that looked like this:
> Failed to start preferred replica election
> org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /brokers/topics/warandpeace/partitions
> 	at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
> 	at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995)
> 	at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675)
> 	at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671)
> 	at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:537)
> 	at kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:817)
> 	at kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:816)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> 	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 	at kafka.utils.ZkUtils.getAllPartitions(ZkUtils.scala:816)
> 	at kafka.admin.PreferredReplicaLeaderElectionCommand$.main(PreferredReplicaLeaderElectionCommand.scala:64)
> 	at kafka.admin.PreferredReplicaLeaderElectionCommand.main(PreferredReplicaLeaderElectionCommand.scala)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
for /brokers/topics/warandpeace/partitions
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> 	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> 	at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114)
> 	at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678)
> 	at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675)
> 	at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)
> 	... 16 more
> I have checked through the code a bit, and have found a quick place to introduce a fix
that would seem to allow the leadership election to continue. Specifically, the function at
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/utils/ZkUtils.scala#L633
does not handle possible exceptions. Wrapping a try/catch block here would work, but could
introduce a number of other problems:
> * If the code is used elsewhere, the exception might be needed at a higher level to prevent
something else.
> * Unless the exception is logged/reported somehow, no one will know this problem exists,
which makes debugging other problems harder.
> I'm sure there are other issues I'm not aware of, but those two come to mind quickly.
What would be the best route for getting this resolved quickly?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message