kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aravind Velamur Srinivasan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-7865) Kafka Constant Consumer Errors for ~30 min after Network Blip
Date Thu, 24 Jan 2019 00:41:00 GMT
Aravind Velamur Srinivasan created KAFKA-7865:
-------------------------------------------------

             Summary: Kafka Constant Consumer Errors for ~30 min after Network Blip
                 Key: KAFKA-7865
                 URL: https://issues.apache.org/jira/browse/KAFKA-7865
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 0.10.2.1
            Reporter: Aravind Velamur Srinivasan


We are running v0.10.2.1 Kafka on AWS backed by EBS with 10 brokers (5 zookeepers). A few
days ago we had a network blip for ~30-45seconds. The interesting part was consumers coordinated
by one of the brokers all kept getting error code 16 (NOT_COORDINATOR) for ~30-35 mins before
eventually receiving the messages successfully.

The broker itself was up and running and the resource utilization was fine as well (in terms
of CPU, memory, disk, etc). In addition the under replicated partitions and other things recovered
within a minute and all the other CGs coordinated by other brokers were fine as well. The
broker had errors during the blip (but just only during the blip like this  - other brokers
saw this as well but were just fine and recovered in ~a minute):
{noformat}
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader
for that topic-partition. (kafka.server.ReplicaFetcherThread)
{noformat}

Eventually after 30 mins it recovered but being a real-time messaging bus, 30 mins is not
so real-time :) 

Some of the questions we have is:
1. Why this was the only broker which was affected? Note: this was not the controller and
this one didn't see any more n/w issues than the others.
2. What made it recover? This is because we didn't change anything or restart anything as
well.
3. Why did the client retries never worked? The client was constantly retrying and kept getting
the same error.
4. Why we didn't notice any error logs as well? 
5. Is this is a known issue which is solved in the later releases?
6. What can we do mitigate this?

Are we running into something like this: org.apache.kafka.common.errors.NotLeaderForPartitionException:
This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

Note: Some of the other settings we have:
zookeeper.connection.timeout.ms=10000 // server.properties
zookeeper.connection.timeout.ms=6000 // consumer.properties




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message