Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@kafka.apache.org
Date: Wed, 15 Mar 2017 06:27:41 +0000 (UTC)
From: "Onur Karaman (JIRA)" <jira@apache.org>
To: dev@kafka.apache.org
Message-ID: <JIRA.13056159.1489549724000.23071.1489559261701@Atlassian.JIRA>
In-Reply-To: <JIRA.13056159.1489549724000@Atlassian.JIRA>
References: <JIRA.13056159.1489549724000@Atlassian.JIRA> <JIRA.13056159.1489549724565@jira-lw-us.apache.org>
Subject: [jira] [Commented] (KAFKA-4900) Brokers stuck in controller
 re-election loop after failing to register metrics
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 15 Mar 2017 06:27:54 -0000


    [ https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925611#comment-15925611 ] 

Onur Karaman commented on KAFKA-4900:
-------------------------------------

For reference, I think the controller was changing every 3 seconds or so during the re-election loop.

We were able to stabilize the cluster after about two hours but haven't yet figured out the root cause.

We didn't know the cause, so we definitely didn't take the optimal sequence of steps to stabilize the cluster. Anyways, here's what we did:
# We figured deleting the /controller znode wouldn't make a difference since it was anyways changing every 3 seconds.
# We initially tried bouncing what we had incorrectly thought was the sole broker corresponding to the metric collision, but this didn't help. The re-election loop continued, and we later found out in the logs that there had been "connection-close-rate" metrics collisions tagged with other brokers.
# We then shifted traffic out of the bad cluster, killed all of the brokers, and restarted them all. Getting the cluster back up and recovering its unflushed segments took around 25 minutes.
# We still saw under-replicated partitions after the cluster was back up and later found that some of the fetchers haven't been started. I think at this point we just deleted the /controller znode to elect a new controller with the hopes of it broadcasting leadership/followership of partitions to the cluster so that the broker's fetchers can start up correctly. After this, the under-replicated partitions came back down to approximately zero. We think the few under-replicated partitions here and there are from uneven partition leadership distribution but this wasn't a big deal so we haven't attempted to fix the few remaining under-replicated partitions yet.

> Brokers stuck in controller re-election loop after failing to register metrics
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-4900
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4900
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, core
>    Affects Versions: 0.10.1.1
>            Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] server.ZookeeperLeaderElector - Error while electing or becoming leader on broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName [name=connection-close-rate, group=controller-channel-metrics,description=Connections closed per second in the window., tags={broker-id=10}]' already exists, can't register another one.
>         at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
>         at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
>         at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
>         at org.apache.kafka.common.network.Selector$SelectorMetrics.<init>(Selector.java:617)
>         at org.apache.kafka.common.network.Selector.<init>(Selector.java:138)
>         at kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
>         at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
>         at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
>         at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
>         at kafka.controller.ControllerChannelManager.<init>(ControllerChannelManager.scala:45)
>         at kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
>         at kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
>         at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
>         at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
>         at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
>         at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
>         at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
>         at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
>         at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
>         at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> {code}
> We observe a tight loop of controller (re-)election, i.e. one node hits this exception, and leadership transitions to the next, which then hits the exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce and consume messages.
> Relevant data points:
>  - prior to this cluster restart, a partition reassignment was attempted for a number of topics, which appeared to get stuck in the "in progress" state (on the order of days)
>  - these topics were subsequently deleted
>  - a rolling restart of the cluster was performed was to turn on broker-to-broker SSL communication
>  - the SSL change has subsequently been _rolled back_ after we observed these exceptions
>  - the entire cluster was shut down, and nodes brought back one at a time in an attempt to clear the exception. We were able to restart the cluster, but we continue to see the exceptions
> We also observed, during the same time as the exception above, the following exception on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on Controller 10]: Error while handling broker changes
> java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
>         at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
>         at kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436)
>         at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
>         at scala.collection.immutable.List.exists(List.scala:84)
>         at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436)
>         at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435)
>         at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>         at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
>         at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
>         at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
>         at kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:435)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:374)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
>         at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:357)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
>         at kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:355)
>         at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:843)
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)