From dev-return-95005-archive-asf-public=cust-asf.ponee.io@kafka.apache.org  Mon Jun 11 22:22:04 2018
Return-Path: <dev-return-95005-archive-asf-public=cust-asf.ponee.io@kafka.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A2C17180670
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 11 Jun 2018 22:22:03 +0200 (CEST)
Received: (qmail 46463 invoked by uid 500); 11 Jun 2018 20:22:02 -0000
Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@kafka.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@kafka.apache.org>
List-Post: <mailto:dev@kafka.apache.org>
List-Id: <dev.kafka.apache.org>
Reply-To: dev@kafka.apache.org
Delivered-To: mailing list dev@kafka.apache.org
Received: (qmail 46450 invoked by uid 99); 11 Jun 2018 20:22:02 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jun 2018 20:22:02 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id F158B1A03AA
	for <dev@kafka.apache.org>; Mon, 11 Jun 2018 20:22:01 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.301
X-Spam-Level:
X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100]
	autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id pEiLpSvkJmj0 for <dev@kafka.apache.org>;
	Mon, 11 Jun 2018 20:22:01 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id DDCCE5F57D
	for <dev@kafka.apache.org>; Mon, 11 Jun 2018 20:22:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 60178E0C51
	for <dev@kafka.apache.org>; Mon, 11 Jun 2018 20:22:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 16B0C21099
	for <dev@kafka.apache.org>; Mon, 11 Jun 2018 20:22:00 +0000 (UTC)
Date: Mon, 11 Jun 2018 20:22:00 +0000 (UTC)
From: "Lucas Wang (JIRA)" <jira@apache.org>
To: dev@kafka.apache.org
Message-ID: <JIRA.13165408.1528748501000.148344.1528748520090@Atlassian.JIRA>
In-Reply-To: <JIRA.13165408.1528748501000@Atlassian.JIRA>
References: <JIRA.13165408.1528748501000@Atlassian.JIRA> <JIRA.13165408.1528748501549@jira-lw-us.apache.org>
Subject: [jira] [Created] (KAFKA-7040) The replica fetcher thread may
 truncate accepted messages during multiple fast leadership transitions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394

Lucas Wang created KAFKA-7040:
---------------------------------

             Summary: The replica fetcher thread may truncate accepted messages during multiple fast leadership transitions
                 Key: KAFKA-7040
                 URL: https://issues.apache.org/jira/browse/KAFKA-7040
             Project: Kafka
          Issue Type: Bug
            Reporter: Lucas Wang


Problem Statement:
Consider the scenario where there are two brokers, broker0, and broker1, and there are two partitions "t1p0", and "t1p1"[1], both of which have broker1 as the leader and broker0 as the follower. The following sequence of events happened on broker0

1. The replica fetcher thread on a broker0 issues a LeaderEpoch request to broker1, and awaits to get the response
2. A LeaderAndISR request causes broker0 to become the leader for one partition t1p0, which in turn will remove the partition t1p0 from the replica fetcher thread
3. Broker0 accepts some messages from a producer
4. A 2nd LeaderAndISR request causes broker1 to become the leader, and broker0 to become the follower for partition t1p0. This will cause the partition t1p0 to be added back to the replica fetcher thread on broker0.
5. The replica fetcher thread on broker0 receives a response for the LeaderEpoch request issued in step 1, and truncates the accepted messages in step3.

The issue can be reproduced with the test from https://github.com/gitlw/kafka/commit/8956e743f0e432cc05648da08c81fc1167b31bea

[1] Initially we set up broker0 to be the follower of two partitions instead of just one, to avoid the shutting down of the replica fetcher thread when it becomes idle.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)