kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Gustafson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (KAFKA-5586) Handle client disconnects during JoinGroup
Date Wed, 19 Jul 2017 23:42:01 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093969#comment-16093969
] 

Jason Gustafson edited comment on KAFKA-5586 at 7/19/17 11:41 PM:
------------------------------------------------------------------

For a bit more background, this problem came up in the context of long Kafka Streams rebalances.
Kafka Streams would like to use an effectively infinite {{max.poll.interval.ms}} (which is
used to derive the rebalance timeout), but technically this requires also setting a large
{{request.timeout.ms}}. So instead they use a normal request timeout and depend on being able
to retry the JoinGroup. Thinking a little more, the case for existing members is probably
already handled adequately since in the common case (clean consumer shutdown), we will send
the LeaveGroup to remove the member from the group (even if it is rebalancing). It is only
members joining for the first time that is problematic since LeaveGroup does not help us there.


was (Author: hachikuji):
For a bit more background, this problem came up in the context of long Kafka Streams rebalances.
Kafka Streams would like to use an effectively infinite {{max.poll.interval.ms}} (which is
used to derive the rebalance timeout), but technically this requires also setting a large
{{request.timeout.ms}}. So instead they use a normal request timeout and depend on being able
to retry the JoinGroup. Thinking a little more, the case for existing members is probably
already handled adequately since in the common case (clean consumer shutdown), we will send
the LeaveGroup. It is only members joining for the first time that is problematic since LeaveGroup
does not help us there.

> Handle client disconnects during JoinGroup
> ------------------------------------------
>
>                 Key: KAFKA-5586
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5586
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>
> If a consumer disconnects with a JoinGroup in-flight, we do not remove it from the group
until after the Join phase completes. If the client immediately re-sends the JoinGroup request
and it already had a memberId, then the callback will be replaced and there is no harm done.
For the other cases:
> 1. If the client disconnected due to a failure and does not re-send the JoinGroup, the
consumer will still be included in the new group generation after the rebalance completes,
but will immediately timeout and trigger a new rebalance.
> 2. If the consumer was not a member of the group and re-sends JoinGroup, then a new memberId
will be created for that consumer and the old one will not be removed. When the rebalance
completes, the old memberId will timeout and a rebalance will be triggered.
> To address these issues, we should add some additional logic to handle client disconnections
during the join phase. For newly generated memberIds, we should simply remove them. For existing
members, we should probably leave them in the group and reset the heartbeat expiration task.
> Note that we currently have no facility to expose disconnects from the network layer
to the other layers, so we need to find a good approach for this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message