Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 36B68200C38 for ; Wed, 15 Mar 2017 07:27:54 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 3536B160B70; Wed, 15 Mar 2017 06:27:54 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 58940160B78 for ; Wed, 15 Mar 2017 07:27:53 +0100 (CET) Received: (qmail 35910 invoked by uid 500); 15 Mar 2017 06:27:52 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 35823 invoked by uid 99); 15 Mar 2017 06:27:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Mar 2017 06:27:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id CF8AA18495F for ; Wed, 15 Mar 2017 06:27:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.451 X-Spam-Level: * X-Spam-Status: No, score=1.451 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id iAgKZ5axGq5Z for ; Wed, 15 Mar 2017 06:27:49 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 8934C5FAF3 for ; Wed, 15 Mar 2017 06:27:49 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 22277E087D for ; Wed, 15 Mar 2017 06:27:49 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id ACD1B243A6 for ; Wed, 15 Mar 2017 06:27:41 +0000 (UTC) Date: Wed, 15 Mar 2017 06:27:41 +0000 (UTC) From: "Onur Karaman (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-4900) Brokers stuck in controller re-election loop after failing to register metrics MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 15 Mar 2017 06:27:54 -0000 [ https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925611#comment-15925611 ] Onur Karaman commented on KAFKA-4900: ------------------------------------- For reference, I think the controller was changing every 3 seconds or so during the re-election loop. We were able to stabilize the cluster after about two hours but haven't yet figured out the root cause. We didn't know the cause, so we definitely didn't take the optimal sequence of steps to stabilize the cluster. Anyways, here's what we did: # We figured deleting the /controller znode wouldn't make a difference since it was anyways changing every 3 seconds. # We initially tried bouncing what we had incorrectly thought was the sole broker corresponding to the metric collision, but this didn't help. The re-election loop continued, and we later found out in the logs that there had been "connection-close-rate" metrics collisions tagged with other brokers. # We then shifted traffic out of the bad cluster, killed all of the brokers, and restarted them all. Getting the cluster back up and recovering its unflushed segments took around 25 minutes. # We still saw under-replicated partitions after the cluster was back up and later found that some of the fetchers haven't been started. I think at this point we just deleted the /controller znode to elect a new controller with the hopes of it broadcasting leadership/followership of partitions to the cluster so that the broker's fetchers can start up correctly. After this, the under-replicated partitions came back down to approximately zero. We think the few under-replicated partitions here and there are from uneven partition leadership distribution but this wasn't a big deal so we haven't attempted to fix the few remaining under-replicated partitions yet. > Brokers stuck in controller re-election loop after failing to register metrics > ------------------------------------------------------------------------------ > > Key: KAFKA-4900 > URL: https://issues.apache.org/jira/browse/KAFKA-4900 > Project: Kafka > Issue Type: Bug > Components: controller, core > Affects Versions: 0.10.1.1 > Reporter: Nick Travers > > We hit this today in one of out three node staging clusters. The exception continues to occur on all three nodes. > {code} > 2017-03-15 02:17:30,677 ERROR [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa] server.ZookeeperLeaderElector - Error while electing or becoming leader on broker 9 > java.lang.IllegalArgumentException: A metric named 'MetricName [name=connection-close-rate, group=controller-channel-metrics,description=Connections closed per second in the window., tags={broker-id=10}]' already exists, can't register another one. > at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179) > at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164) > at org.apache.kafka.common.network.Selector$SelectorMetrics.(Selector.java:617) > at org.apache.kafka.common.network.Selector.(Selector.java:138) > at kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101) > at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45) > at scala.collection.immutable.Set$Set3.foreach(Set.scala:163) > at kafka.controller.ControllerChannelManager.(ControllerChannelManager.scala:45) > at kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814) > at kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742) > at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334) > at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167) > at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84) > at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) > at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) > at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) > at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {code} > We observe a tight loop of controller (re-)election, i.e. one node hits this exception, and leadership transitions to the next, which then hits the exception, ad infinitum. > Producers and consumers appear to be connecting ok, and are able to produce and consume messages. > Relevant data points: > - prior to this cluster restart, a partition reassignment was attempted for a number of topics, which appeared to get stuck in the "in progress" state (on the order of days) > - these topics were subsequently deleted > - a rolling restart of the cluster was performed was to turn on broker-to-broker SSL communication > - the SSL change has subsequently been _rolled back_ after we observed these exceptions > - the entire cluster was shut down, and nodes brought back one at a time in an attempt to clear the exception. We were able to restart the cluster, but we continue to see the exceptions > We also observed, during the same time as the exception above, the following exception on all hosts: > {code} > 2017-03-15 01:44:04,572 ERROR [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa] controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener on Controller 10]: Error while handling broker changes > java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) > at kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436) > at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) > at scala.collection.immutable.List.exists(List.scala:84) > at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436) > at kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435) > at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) > at scala.collection.AbstractTraversable.filter(Traversable.scala:104) > at kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:435) > at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:374) > at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358) > at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:357) > at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356) > at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) > at kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:355) > at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:843) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)