kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Onur Karaman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics
Date Wed, 15 Mar 2017 06:27:41 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925611#comment-15925611
] 

Onur Karaman commented on KAFKA-4900:
-------------------------------------

For reference, I think the controller was changing every 3 seconds or so during the re-election
loop.

We were able to stabilize the cluster after about two hours but haven't yet figured out the
root cause.

We didn't know the cause, so we definitely didn't take the optimal sequence of steps to stabilize
the cluster. Anyways, here's what we did:
# We figured deleting the /controller znode wouldn't make a difference since it was anyways
changing every 3 seconds.
# We initially tried bouncing what we had incorrectly thought was the sole broker corresponding
to the metric collision, but this didn't help. The re-election loop continued, and we later
found out in the logs that there had been "connection-close-rate" metrics collisions tagged
with other brokers.
# We then shifted traffic out of the bad cluster, killed all of the brokers, and restarted
them all. Getting the cluster back up and recovering its unflushed segments took around 25
minutes.
# We still saw under-replicated partitions after the cluster was back up and later found that
some of the fetchers haven't been started. I think at this point we just deleted the /controller
znode to elect a new controller with the hopes of it broadcasting leadership/followership
of partitions to the cluster so that the broker's fetchers can start up correctly. After this,
the under-replicated partitions came back down to approximately zero. We think the few under-replicated
partitions here and there are from uneven partition leadership distribution but this wasn't
a big deal so we haven't attempted to fix the few remaining under-replicated partitions yet.

> Brokers stuck in controller re-election loop after failing to register metrics
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-4900
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4900
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, core
>    Affects Versions: 0.10.1.1
>            Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception continues
to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa]
server.ZookeeperLeaderElector - Error while electing or becoming leader on broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName [name=connection-close-rate,
group=controller-channel-metrics,description=Connections closed per second in the window.,
tags={broker-id=10}]' already exists, can't register another one.
>         at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
>         at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
>         at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
>         at org.apache.kafka.common.network.Selector$SelectorMetrics.<init>(Selector.java:617)
>         at org.apache.kafka.common.network.Selector.<init>(Selector.java:138)
>         at kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
>         at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
>         at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
>         at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
>         at kafka.controller.ControllerChannelManager.<init>(ControllerChannelManager.scala:45)
>         at kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
>         at kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
>         at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
>         at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
>         at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
>         at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
>         at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
>         at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
>         at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
>         at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this exception,
and leadership transitions to the next, which then hits the exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce and consume
messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassignment was attempted for a number
of topics, which appeared to get stuck in the "in progress" state (on the order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on broker-to-broker SSL
communication
>  - the SSL change has subsequently been _rolled back_ after we observed these exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in an attempt
to clear the exception. We were able to restart the cluster, but we continue to see the exceptions
> We also observed, during the same time as the exception above, the following exception
on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa]
controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on Controller
10]: Error while handling broker changes
> java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
>         at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
>         at kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436)
>         at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
>         at scala.collection.immutable.List.exists(List.scala:84)
>         at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436)
>         at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435)
>         at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>         at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
>         at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
>         at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
>         at kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:435)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:374)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
>         at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:357)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:355)
>         at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:843)
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message