kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Gustafson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-6671) Consumer group coordinator releases group before new coordinator is ready.
Date Fri, 16 Mar 2018 18:40:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402337#comment-16402337
] 

Jason Gustafson commented on KAFKA-6671:
----------------------------------------

One cause of slow coordinator failover is an oversized __consumer_offsets topic. Can you
verify the size of the __consumer_offsets partitions and whether the log cleaner is enabled?

> Consumer group coordinator releases group before new coordinator is ready.
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-6671
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6671
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.10.2.1
>            Reporter: Rob Gevers
>            Priority: Major
>
> We regularly have an issue with our Kafka deploys which causes consumers to be unable
to consume for an extended period of time (up to an hour) after the deploy finishes. The issue
appears to be a side-effect of the way consumer group coordination is managed between nodes.
A sample timeline of a deploy looks like the following:
> We initiate a clean shutdown of a node (which we will call kafka-2). We see these traces:
> {noformat}
>  [2018-02-20 09:13:46,935] INFO [GroupCoordinator 1]: Loading group metadata for ConsumerGroup
with generation 3041 (kafka.coordinator.GroupCoordinator){noformat}
> {noformat}
>  [2018-02-20 09:13:47,788] INFO [GroupCoordinator 2]: Unloading group metadata for ConsumerGroup
with generation 3041{noformat}
> At this point kafka-2 is shutdown and restarted successfully. Consumers continue to function
fine. Once kafka-2 is back online we see this trace from kafka-1 
> {noformat}
>  [2018-02-20 09:49:30,486] INFO [GroupCoordinator 1]: Unloading group metadata for ConsumerGroup
with generation 3041{noformat}
> At this point the consumers go into a loop of "Discovered coordinator Kafka-2"Marking
the coordinator Kafka-2 dead". This preempts the heartbeat timer and we even see the heartbeat
rate metrics drop to 0. This continues until kafka-2 has finished processing offset data and
finally traces
> {noformat}
>  [2018-02-20 10:52:28,956] INFO [GroupCoordinator 2]: Loading group metadata for ConsumerGroup
with generation 3041 (kafka.coordinator.GroupCoordinator){noformat}
> What seems like a bug to me is that kafka-1 is unloading the consumer group long before
kafka-2 is ready to load it. This seems to leave the group in an unusable state, with offset
commits failing because they are trying to commit to kafka-2, but kafka-2 keeps responding
that it isn't the group coordinator. There is no coordinator for an hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message