Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jira@kafka.apache.org
Date: Mon, 4 Dec 2017 17:47:00 +0000 (UTC)
From: "Ramnatthan Alagappan (JIRA)" <jira@apache.org>
To: jira@kafka.apache.org
Message-ID: <JIRA.12677618.1383675825000.380230.1512409620522@Atlassian.JIRA>
In-Reply-To: <JIRA.12677618.1383675825000@Atlassian.JIRA>
References: <JIRA.12677618.1383675825000@Atlassian.JIRA> <JIRA.12677618.1383675825685@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (KAFKA-1120) Controller could miss a broker
 state change
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 04 Dec 2017 17:47:06 -0000


    [ https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277146#comment-16277146 ] 

Ramnatthan Alagappan edited comment on KAFKA-1120 at 12/4/17 5:46 PM:
----------------------------------------------------------------------

I ran into this issue and have a reproducible setup irrespective of the number of partitions or nodes. [~onurkaraman]'s analysis in comment @  [#comment-16113645] is correct. The root cause is that the shutdown broker restarts and registers with ZK in a short interval of time. When the broker shutsdown, ZK delivers a callback for deletion of the broker. Before ZKClient can reestablish the callback (by issuing a stat call), the broker registers with ZK. By the time ZKClient gets the /brokers/ids node from ZK, the shutdown broker also appears in /brokers/ids. With this, the shutdown broker appears both in curBrokerIds and liveOrShuttingDownBrokerIds, causing newBrokerIds to be empty, which causes this problem. 


was (Author: ramanala):
I ran into this issue and have a reproducible setup irrespective of the number of partitions or nodes. [~onurkaraman]'s analysis in comment @  [#comment-16113645] is correct. The root cause is that the shutdown broker restarts and registers with ZK in a short interval of time. During this time, ZK delivers a callback for deletion of the broker. Before ZKClient can reestablish the callback (by issuing a stat call), the broker registers with ZK. By the time ZKClient gets the /brokers/ids node from ZK, the shutdown broker also appears in /brokers/ids. With this, the shutdown broker appears both in curBrokerIds and liveOrShuttingDownBrokerIds, causing newBrokerIds to be empty, which causes this problem. 

> Controller could miss a broker state change 
> --------------------------------------------
>
>                 Key: KAFKA-1120
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1120
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Jun Rao
>            Assignee: Mickael Maison
>              Labels: reliability
>             Fix For: 1.1.0
>
>
> When the controller is in the middle of processing a task (e.g., preferred leader election, broker change), it holds a controller lock. During this time, a broker could have de-registered and re-registered itself in ZK. After the controller finishes processing the current task, it will start processing the logic in the broker change listener. However, it will see no broker change and therefore won't do anything to the restarted broker. This broker will be in a weird state since the controller doesn't inform it to become the leader of any partition. Yet, the cached metadata in other brokers could still list that broker as the leader for some partitions. Client requests routed to that broker will then get a TopicOrPartitionNotExistException. This broker will continue to be in this bad state until it's restarted again.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)