kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Eisele (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-6715) Leader transition for all partitions lead by two brokers without visible reason
Date Mon, 26 Mar 2018 13:24:00 GMT
Uwe Eisele created KAFKA-6715:
---------------------------------

             Summary: Leader transition for all partitions lead by two brokers without visible
reason
                 Key: KAFKA-6715
                 URL: https://issues.apache.org/jira/browse/KAFKA-6715
             Project: Kafka
          Issue Type: Bug
          Components: core, replication
    Affects Versions: 0.11.0.2
         Environment: Kafka cluster on Amazon AWS EC2 r4.2xlarge instances with 5 nodes and
a Zookeeper cluster on r4.2xlarge instances with 3 nodes. The cluster is distributed across
2 availability zones.
            Reporter: Uwe Eisele


In our cluster we experienced a situation, in which the leader of all partitions lead by two
brokers has been moved mainly to one other broker.

We don't know why this happend. At this time there was not broker outage, nor a broker shutdown
has been initiated. The Zookeeper nodes of the affected brokers (/brokers/ids/3, /brokers/ids/4)
has not been modified during this time.

In addition there are no logs that would indicate a leader transition for the affected brokers.
We would expect to see a "{{sending become-leader LeaderAndIsr request}}" in the controller
log for each partition, as well a "{{completed LeaderAndIsr request}}" in the state change
log of the Kafka brokers that becomes the new leader and follower. Our log level for the kafka.controller
and the state change log is set to TRACE.

Though all Brokers are running, the situation does not recover. It sticks in a highly imbalanced
leader distribution, in which two brokers are no leader for any partition, and one broker
is the leader for almost all partitions.
{code:java}
kafka-controller Log (Level TRACE):
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for broker 5 is 0.0
(kafka.controller.KafkaController)
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for broker 1 is 0.0
(kafka.controller.KafkaController)
[2018-03-19 17:03:54,042] TRACE [Controller 3]: Leader imbalance ratio for broker 2 is 0.0
(kafka.controller.KafkaController)
[2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for broker 3 is 0.0
(kafka.controller.KafkaController)
[2018-03-19 17:03:54,043] TRACE [Controller 3]: Leader imbalance ratio for broker 4 is 0.0
(kafka.controller.KafkaController)
...
[2018-03-19 17:08:54,049] TRACE [Controller 3]: Leader imbalance ratio for broker 5 is 0.8054794520547945
(kafka.controller.KafkaController)
[2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for broker 1 is 0.0
(kafka.controller.KafkaController)
[2018-03-19 17:08:54,050] TRACE [Controller 3]: Leader imbalance ratio for broker 2 is 0.4807692307692308
(kafka.controller.KafkaController)
[2018-03-19 17:08:54,051] TRACE [Controller 3]: Leader imbalance ratio for broker 3 is 1.0
(kafka.controller.KafkaController)
[2018-03-19 17:08:54,053] TRACE [Controller 3]: Leader imbalance ratio for broker 4 is 1.0
(kafka.controller.KafkaController)
...
[2018-03-19 17:23:54,080] TRACE [Controller 3]: Leader imbalance ratio for broker 5 is 0.8054794520547945
(kafka.controller.KafkaController)
[2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for broker 1 is 0.0
(kafka.controller.KafkaController)
[2018-03-19 17:23:54,081] TRACE [Controller 3]: Leader imbalance ratio for broker 2 is 0.4807692307692308
(kafka.controller.KafkaController)
[2018-03-19 17:23:54,082] TRACE [Controller 3]: Leader imbalance ratio for broker 3 is 1.0
(kafka.controller.KafkaController)
[2018-03-19 17:23:54,084] TRACE [Controller 3]: Leader imbalance ratio for broker 4 is 1.0
(kafka.controller.KafkaController)
{code}
The imbalance was recognized by the controller, but nothing happend.

In addition it seems that the ReplicaFetcherThreads die without any log message, though we
think this is not possible... However, we would expect log messages that state, that fetchers
for partitions has been removed, as well that the ReplicaFetcherThreads are shutting down.
The log level for _kafka_ is set to INFO. In other situations, when a broker is shuttdown
we see such entries in the log files.

Besides that, this caused underreplicated partitions. It seems that no broker fetches from
the partitions with the newly assigned leaders. Like the situation with the highly imbalanced
leader distribution the cluster sticks in this state and does not recover.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message