From jira-return-11163-archive-asf-public=cust-asf.ponee.io@kafka.apache.org  Mon Mar 26 16:15:44 2018
Return-Path: <jira-return-11163-archive-asf-public=cust-asf.ponee.io@kafka.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 63A27180649
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 26 Mar 2018 16:15:44 +0200 (CEST)
Received: (qmail 80938 invoked by uid 500); 26 Mar 2018 14:15:43 -0000
Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:jira-help@kafka.apache.org>
List-Unsubscribe: <mailto:jira-unsubscribe@kafka.apache.org>
List-Post: <mailto:jira@kafka.apache.org>
List-Id: <jira.kafka.apache.org>
Reply-To: jira@kafka.apache.org
Delivered-To: mailing list jira@kafka.apache.org
Received: (qmail 80927 invoked by uid 99); 26 Mar 2018 14:15:43 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2018 14:15:43 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id DB152C00E7
	for <jira@kafka.apache.org>; Mon, 26 Mar 2018 14:15:42 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.311
X-Spam-Level:
X-Spam-Status: No, score=-110.311 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id zGHesomKx7RJ for <jira@kafka.apache.org>;
	Mon, 26 Mar 2018 14:15:41 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id E0F79631AB
	for <jira@kafka.apache.org>; Mon, 26 Mar 2018 13:24:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 0E6E5E0DE0
	for <jira@kafka.apache.org>; Mon, 26 Mar 2018 13:24:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 612C821502
	for <jira@kafka.apache.org>; Mon, 26 Mar 2018 13:24:00 +0000 (UTC)
Date: Mon, 26 Mar 2018 13:24:00 +0000 (UTC)
From: "Uwe Eisele (JIRA)" <jira@apache.org>
To: jira@kafka.apache.org
Message-ID: <JIRA.13147934.1522070611000.108092.1522070640396@Atlassian.JIRA>
In-Reply-To: <JIRA.13147934.1522070611000@Atlassian.JIRA>
References: <JIRA.13147934.1522070611000@Atlassian.JIRA> <JIRA.13147934.1522070611157@jira-lw-us.apache.org>
Subject: [jira] [Created] (KAFKA-6715) Leader transition for all partitions
 lead by two brokers without visible reason
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394

Uwe Eisele created KAFKA-6715:
---------------------------------

             Summary: Leader transition for all partitions lead by two brokers without visible reason
                 Key: KAFKA-6715
                 URL: https://issues.apache.org/jira/browse/KAFKA-6715
             Project: Kafka
          Issue Type: Bug
          Components: core, replication
    Affects Versions: 0.11.0.2
         Environment: Kafka cluster on Amazon AWS EC2 r4.2xlarge instances with 5 nodes and a Zookeeper cluster on r4.2xlarge instances with 3 nodes. The cluster is distributed across 2 availability zones.
            Reporter: Uwe Eisele


In our cluster we experienced a situation, in which the leader of all partitions lead by two brokers has been moved mainly to one other broker.

We don't know why this happend. At this time there was not broker outage, nor a broker shutdown has been initiated. The Zookeeper nodes of the affected brokers (/brokers/ids/3, /brokers/ids/4) has not been modified during this time.

In addition there are no logs that would indicate a leader transition for the affected brokers. We would expect to see a "{{sending become-leader LeaderAndIsr request}}" in the controller log for each partition, as well a "{{completed LeaderAndIsr request}}" in the state change log of the Kafka brokers that becomes the new leader and follower. Our log level for the kafka.controller and the state change log is set to TRACE.

Though all Brokers are running, the situation does not recover. It sticks in a highly imbalanced leader distribution, in which two brokers are no leader for any partition, and one broker is the leader for almost all partitions.
{code:java}
kafka-controller Log (Level TRACE):
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for broker 5 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for broker 3 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for broker 4 is 0.0 (kafka.controller.KafkaController)
...
[2018-03-19 17:08:54,049] TRACE [Controller 3]: Leader imbalance ratio for broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,051] TRACE [Controller 3]: Leader imbalance ratio for broker 3 is 1.0 (kafka.controller.KafkaController)
[2018-03-19 17:08:54,053] TRACE [Controller 3]: Leader imbalance ratio for broker 4 is 1.0 (kafka.controller.KafkaController)
...
[2018-03-19 17:23:54,080] TRACE [Controller 3]: Leader imbalance ratio for broker 5 is 0.8054794520547945 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for broker 2 is 0.4807692307692308 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,082] TRACE [Controller 3]: Leader imbalance ratio for broker 3 is 1.0 (kafka.controller.KafkaController)
[2018-03-19 17:23:54,084] TRACE [Controller 3]: Leader imbalance ratio for broker 4 is 1.0 (kafka.controller.KafkaController)
{code}
The imbalance was recognized by the controller, but nothing happend.

In addition it seems that the ReplicaFetcherThreads die without any log message, though we think this is not possible... However, we would expect log messages that state, that fetchers for partitions has been removed, as well that the ReplicaFetcherThreads are shutting down. The log level for _kafka_ is set to INFO. In other situations, when a broker is shuttdown we see such entries in the log files.

Besides that, this caused underreplicated partitions. It seems that no broker fetches from the partitions with the newly assigned leaders. Like the situation with the highly imbalanced leader distribution the cluster sticks in this state and does not recover.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)