kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edoardo Comar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-1120) Controller could miss a broker state change
Date Tue, 28 Nov 2017 13:42:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268731#comment-16268731

Edoardo Comar commented on KAFKA-1120:

Interestingly using [~wushujames] script in [#comment-16110002] on a development laptop running
trunk code :
* with the suggested 2x5000 partitions, 2x replicated - the cluster is unstable, after resting
idle, in a steady state for some 5-10 minutes, one or two of the brokers get disconnected
from zookeeper, will reconnect and start a bounce where one or the other get out of sync
* with lower number of partitions (eg 2500,3500) the above instability doesn't show but with
either a controlled shudown with short timeout, or a ungraceful kill, followed by broker restart
get the cluster back in sync without issues

> Controller could miss a broker state change 
> --------------------------------------------
>                 Key: KAFKA-1120
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1120
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Jun Rao
>            Assignee: Mickael Maison
>              Labels: reliability
>             Fix For: 1.1.0
> When the controller is in the middle of processing a task (e.g., preferred leader election,
broker change), it holds a controller lock. During this time, a broker could have de-registered
and re-registered itself in ZK. After the controller finishes processing the current task,
it will start processing the logic in the broker change listener. However, it will see no
broker change and therefore won't do anything to the restarted broker. This broker will be
in a weird state since the controller doesn't inform it to become the leader of any partition.
Yet, the cached metadata in other brokers could still list that broker as the leader for some
partitions. Client requests routed to that broker will then get a TopicOrPartitionNotExistException.
This broker will continue to be in this bad state until it's restarted again.

This message was sent by Atlassian JIRA

View raw message