From dev-return-107415-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Thu Sep 12 04:24:03 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 627EC18063F for ; Thu, 12 Sep 2019 06:24:03 +0200 (CEST) Received: (qmail 67950 invoked by uid 500); 12 Sep 2019 04:24:01 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 67924 invoked by uid 99); 12 Sep 2019 04:24:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Sep 2019 04:24:01 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 92FBCE2C65 for ; Thu, 12 Sep 2019 04:24:00 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 0E5557803DB for ; Thu, 12 Sep 2019 04:24:00 +0000 (UTC) Date: Thu, 12 Sep 2019 04:24:00 +0000 (UTC) From: "Luke Stephenson (Jira)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (KAFKA-8900) Stalled partitions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Luke Stephenson created KAFKA-8900: -------------------------------------- Summary: Stalled partitions Key: KAFKA-8900 URL: https://issues.apache.org/jira/browse/KAFKA-8900 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.1.1 Reporter: Luke Stephenson I'm seeing behaviour where a Scala KafkaConsumer has stalled for 1 partition for a topic. All other partitions for that topic are successfully being consumed. Restarting the consumer process does not resolve the issue. The consumer is using version 2.3.0 ("org.apache.kafka" % "kafka-clients" % "2.3.0"). When the consumer starts, I see that it is assigned the partition. However it then logs: {code} [Consumer clientId=kafka-bus-router-64c88855cf-hxck7.event-bus-router-consumer.1d1ed7ee-5038-4441-84eb-8080ac130e9a, groupId=event-bus-router] Setting offset for partition maxwell.transactions-22 to the committed offset FetchPosition{offset=275413397, offsetEpoch=Optional[271], currentLeader=LeaderAndEpoch{leader=:-1 (id: -1 rack: null), epoch=271}} {code} Note that the leader is logged as "-1". If I search through my application logs for the past couple of days, the only time I ever see this logged on the consumer is for this partition. The kafka broker is running version 2.1.1. On the broker side the logs show: {code} {"timeMillis":1568087844876,"thread":"kafka-request-handler-1","level":"WARN","loggerName":"state.change.logger","message":"[Broker id=5] Ignoring LeaderAndIsr request from controller 4 with correlation id 15943 epoch 155 for partition maxwell.transactions-22 since its associated leader epoch 270 is not higher than the current leader epoch 270","endOfBatch":false,"loggerFqcn":"org.slf4j.impl.Log4jLoggerAdapter","threadId":72,"threadPriority":5} {"timeMillis":1568087844880,"thread":"kafka-request-handler-1","level":"INFO","loggerName":"kafka.server.ReplicaFetcherManager","message":"[ReplicaFetcherManager on broker 5] Removed fetcher for partitions Set(maxwell.transactions-22)","endOfBatch":false,"loggerFqcn":"org.slf4j.impl.Log4jLoggerAdapter","threadId":72,"threadPriority":5} {"timeMillis":1568087844880,"thread":"kafka-request-handler-1","level":"INFO","loggerName":"kafka.cluster.Partition","message":"[Partition maxwell.transactions-22 broker=5] maxwell.transactions-22 starts at Leader Epoch 271 from offset 275403423. Previous Leader Epoch was: 270","endOfBatch":false,"loggerFqcn":"org.slf4j.impl.Log4jLoggerAdapter","threadId":72,"threadPriority":5} {"timeMillis":1568087844891,"thread":"kafka-request-handler-1","level":"INFO","loggerName":"state.change.logger","message":"[Broker id=5] Skipped the become-leader state change after marking its partition as leader with correlation id 15945 from controller 4 epoch 155 for partition maxwell.transactions-22 (last update controller epoch 155) since it is already the leader for the partition.","endOfBatch":false,"loggerFqcn":"org.slf4j.impl.Log4jLoggerAdapter","threadId":72,"threadPriority":5} {code} As soon as I restart the broker which is the leader for that partition, the messages flow through to the consumer. Given restarts of the consumer don't help, but restarting the broker allows the stalled partition to resume, I'm inclined to think this is an issue with the broker. Please let me know if I can assist further with investigating or resolving this. -- This message was sent by Atlassian Jira (v8.3.2#803003)