Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 674EC200CF8 for ; Thu, 31 Aug 2017 03:12:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 63F4216A514; Thu, 31 Aug 2017 01:12:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AB13A16A50C for ; Thu, 31 Aug 2017 03:12:09 +0200 (CEST) Received: (qmail 1414 invoked by uid 500); 31 Aug 2017 01:12:05 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 1394 invoked by uid 99); 31 Aug 2017 01:12:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Aug 2017 01:12:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 10ACE1A16CA for ; Thu, 31 Aug 2017 01:12:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id KuLY10-Xsnnt for ; Thu, 31 Aug 2017 01:12:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 3D26861263 for ; Thu, 31 Aug 2017 01:12:03 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 95CA8E02C7 for ; Thu, 31 Aug 2017 01:12:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6E0052416B for ; Thu, 31 Aug 2017 01:12:00 +0000 (UTC) Date: Thu, 31 Aug 2017 01:12:00 +0000 (UTC) From: "Allen Wang (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (KAFKA-5813) Unexpected unclean leader election due to leader/controller's unusual event handling order MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 31 Aug 2017 01:12:10 -0000 Allen Wang created KAFKA-5813: --------------------------------- Summary: Unexpected unclean leader election due to leader/controller's unusual event handling order Key: KAFKA-5813 URL: https://issues.apache.org/jira/browse/KAFKA-5813 Project: Kafka Issue Type: Improvement Affects Versions: 0.10.2.1 Reporter: Allen Wang Priority: Minor We experienced an unexpected unclean leader election after network glitch happened on the leader of partition. We have replication factor 2. Here is the sequence of event gathered from various logs: 1. ZK session timeout happens for leader of partition 2. New ZK session is established for leader 3. Leader removes the follower from ISR (which might be caused by replication delay due to the network problem) and updates the ISR in ZK 4. Controller processes the BrokerChangeListener event happened at step 1 where the leader seems to be offline 5. Because the ISR in ZK is already updated by leader to remove the follower, controller makes an unclean leader election 6. Controller processes the second BrokerChangeListener event happened at step 2 to mark the broker online again It seems to me that step 4 happens too late. If it happens right after step 1, it will be a clean leader election and hopefully the producer will immediately switch to the new leader, thus avoiding consumer offset reset. -- This message was sent by Atlassian JIRA (v6.4.14#64029)