Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 39126200D4B for ; Mon, 27 Nov 2017 18:22:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 37B9E160C13; Mon, 27 Nov 2017 17:22:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 823D9160BFA for ; Mon, 27 Nov 2017 18:22:04 +0100 (CET) Received: (qmail 34618 invoked by uid 500); 27 Nov 2017 17:22:03 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 34607 invoked by uid 99); 27 Nov 2017 17:22:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Nov 2017 17:22:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B79531A12D4 for ; Mon, 27 Nov 2017 17:22:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id ggYZpJXW9WG5 for ; Mon, 27 Nov 2017 17:22:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 5264E5F20B for ; Mon, 27 Nov 2017 17:22:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 66C2DE0942 for ; Mon, 27 Nov 2017 17:22:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 233BC241A0 for ; Mon, 27 Nov 2017 17:22:00 +0000 (UTC) Date: Mon, 27 Nov 2017 17:22:00 +0000 (UTC) From: "Jun Rao (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-1120) Controller could miss a broker state change MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 27 Nov 2017 17:22:05 -0000 [ https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267089#comment-16267089 ] Jun Rao commented on KAFKA-1120: -------------------------------- [~mimaison], yes, the issue is still not fixed in trunk. To fix this, the controller will need to track the broker's ZK session id as you said. > Controller could miss a broker state change > -------------------------------------------- > > Key: KAFKA-1120 > URL: https://issues.apache.org/jira/browse/KAFKA-1120 > Project: Kafka > Issue Type: Sub-task > Components: core > Affects Versions: 0.8.1 > Reporter: Jun Rao > Labels: reliability > Fix For: 1.1.0 > > > When the controller is in the middle of processing a task (e.g., preferred leader election, broker change), it holds a controller lock. During this time, a broker could have de-registered and re-registered itself in ZK. After the controller finishes processing the current task, it will start processing the logic in the broker change listener. However, it will see no broker change and therefore won't do anything to the restarted broker. This broker will be in a weird state since the controller doesn't inform it to become the leader of any partition. Yet, the cached metadata in other brokers could still list that broker as the leader for some partitions. Client requests routed to that broker will then get a TopicOrPartitionNotExistException. This broker will continue to be in this bad state until it's restarted again. -- This message was sent by Atlassian JIRA (v6.4.14#64029)