From dev-return-107415-archive-asf-public=cust-asf.ponee.io@kafka.apache.org  Thu Sep 12 04:24:03 2019
Return-Path: <dev-return-107415-archive-asf-public=cust-asf.ponee.io@kafka.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 627EC18063F
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 12 Sep 2019 06:24:03 +0200 (CEST)
Received: (qmail 67950 invoked by uid 500); 12 Sep 2019 04:24:01 -0000
Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@kafka.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@kafka.apache.org>
List-Post: <mailto:dev@kafka.apache.org>
List-Id: <dev.kafka.apache.org>
Reply-To: dev@kafka.apache.org
Delivered-To: mailing list dev@kafka.apache.org
Received: (qmail 67924 invoked by uid 99); 12 Sep 2019 04:24:01 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Sep 2019 04:24:01 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 92FBCE2C65
	for <dev@kafka.apache.org>; Thu, 12 Sep 2019 04:24:00 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 0E5557803DB
	for <dev@kafka.apache.org>; Thu, 12 Sep 2019 04:24:00 +0000 (UTC)
Date: Thu, 12 Sep 2019 04:24:00 +0000 (UTC)
From: "Luke Stephenson (Jira)" <jira@apache.org>
To: dev@kafka.apache.org
Message-ID: <JIRA.13256258.1568262191000.49178.1568262240057@Atlassian.JIRA>
In-Reply-To: <JIRA.13256258.1568262191000@Atlassian.JIRA>
References: <JIRA.13256258.1568262191000@Atlassian.JIRA> <JIRA.13256258.1568262191803@jira-he-de>
Subject: [jira] [Created] (KAFKA-8900) Stalled partitions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394

Luke Stephenson created KAFKA-8900:
--------------------------------------

             Summary: Stalled partitions
                 Key: KAFKA-8900
                 URL: https://issues.apache.org/jira/browse/KAFKA-8900
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 2.1.1
            Reporter: Luke Stephenson


I'm seeing behaviour where a Scala KafkaConsumer has stalled for 1 partition for a topic.  All other partitions for that topic are successfully being consumed.

Restarting the consumer process does not resolve the issue.  The consumer is using version 2.3.0 ("org.apache.kafka" % "kafka-clients" % "2.3.0").

When the consumer starts, I see that it is assigned the partition.  However it then logs:
{code}
[Consumer clientId=kafka-bus-router-64c88855cf-hxck7.event-bus-router-consumer.1d1ed7ee-5038-4441-84eb-8080ac130e9a, groupId=event-bus-router] Setting offset for partition maxwell.transactions-22 to the committed offset FetchPosition{offset=275413397, offsetEpoch=Optional[271], currentLeader=LeaderAndEpoch{leader=:-1 (id: -1 rack: null), epoch=271}}
{code}

Note that the leader is logged as "-1".  If I search through my application logs for the past couple of days, the only time I ever see this logged on the consumer is for this partition.

The kafka broker is running version 2.1.1.  On the broker side the logs show:
{code}
{"timeMillis":1568087844876,"thread":"kafka-request-handler-1","level":"WARN","loggerName":"state.change.logger","message":"[Broker id=5] Ignoring LeaderAndIsr request from controller 4 with correlation id 15943 epoch 155 for partition maxwell.transactions-22 since its associated leader epoch 270 is not higher than the current leader epoch 270","endOfBatch":false,"loggerFqcn":"org.slf4j.impl.Log4jLoggerAdapter","threadId":72,"threadPriority":5}
{"timeMillis":1568087844880,"thread":"kafka-request-handler-1","level":"INFO","loggerName":"kafka.server.ReplicaFetcherManager","message":"[ReplicaFetcherManager on broker 5] Removed fetcher for partitions Set(maxwell.transactions-22)","endOfBatch":false,"loggerFqcn":"org.slf4j.impl.Log4jLoggerAdapter","threadId":72,"threadPriority":5}
{"timeMillis":1568087844880,"thread":"kafka-request-handler-1","level":"INFO","loggerName":"kafka.cluster.Partition","message":"[Partition maxwell.transactions-22 broker=5] maxwell.transactions-22 starts at Leader Epoch 271 from offset 275403423. Previous Leader Epoch was: 270","endOfBatch":false,"loggerFqcn":"org.slf4j.impl.Log4jLoggerAdapter","threadId":72,"threadPriority":5}
{"timeMillis":1568087844891,"thread":"kafka-request-handler-1","level":"INFO","loggerName":"state.change.logger","message":"[Broker id=5] Skipped the become-leader state change after marking its partition as leader with correlation id 15945 from controller 4 epoch 155 for partition maxwell.transactions-22 (last update controller epoch 155) since it is already the leader for the partition.","endOfBatch":false,"loggerFqcn":"org.slf4j.impl.Log4jLoggerAdapter","threadId":72,"threadPriority":5}
{code}

As soon as I restart the broker which is the leader for that partition, the messages flow through to the consumer.

Given restarts of the consumer don't help, but restarting the broker allows the stalled partition to resume, I'm inclined to think this is an issue with the broker.  Please let me know if I can assist further with investigating or resolving this.


--
This message was sent by Atlassian Jira
(v8.3.2#803003)