Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@kafka.apache.org
Date: Thu, 31 Aug 2017 01:12:00 +0000 (UTC)
From: "Allen Wang (JIRA)" <jira@apache.org>
To: dev@kafka.apache.org
Message-ID: <JIRA.13098702.1504141897000.165748.1504141920447@Atlassian.JIRA>
In-Reply-To: <JIRA.13098702.1504141897000@Atlassian.JIRA>
References: <JIRA.13098702.1504141897000@Atlassian.JIRA> <JIRA.13098702.1504141897056@jira-lw-us.apache.org>
Subject: [jira] [Created] (KAFKA-5813) Unexpected unclean leader election
 due to leader/controller's unusual event handling order
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 31 Aug 2017 01:12:10 -0000

Allen Wang created KAFKA-5813:
---------------------------------

             Summary: Unexpected unclean leader election due to leader/controller's unusual event handling order 
                 Key: KAFKA-5813
                 URL: https://issues.apache.org/jira/browse/KAFKA-5813
             Project: Kafka
          Issue Type: Improvement
    Affects Versions: 0.10.2.1
            Reporter: Allen Wang
            Priority: Minor


We experienced an unexpected unclean leader election after network glitch happened on the leader of partition. We have replication factor 2.

Here is the sequence of event gathered from various logs:

1. ZK session timeout happens for leader of partition 
2. New ZK session is established for leader 
3. Leader removes the follower from ISR (which might be caused by replication delay due to the network problem) and updates the ISR in ZK 
4. Controller processes the BrokerChangeListener event happened at step 1 where the leader seems to be offline 
5. Because the ISR in ZK is already updated by leader to remove the follower, controller makes an unclean leader election 
6. Controller processes the second BrokerChangeListener event happened at step 2 to mark the broker online again

It seems to me that step 4 happens too late. If it happens right after step 1, it will be a clean leader election and hopefully the producer will immediately switch to the new leader, thus avoiding consumer offset reset. 


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)