kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Boyang Chen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-8626) Group will fall into constant incremental rebalancing with a long non-responsive static member
Date Wed, 03 Jul 2019 16:25:00 GMT
Boyang Chen created KAFKA-8626:

             Summary: Group will fall into constant incremental rebalancing with a long non-responsive
static member
                 Key: KAFKA-8626
                 URL: https://issues.apache.org/jira/browse/KAFKA-8626
             Project: Kafka
          Issue Type: Bug
            Reporter: Boyang Chen
            Assignee: Boyang Chen

Currently when a group rebalances, static members have up until the expiration of the rebalance
timeout to rejoin. if they do not rejoin in time, then they are rejoined virtually by the
coordinator. basically the coordinator just uses the old subscription. This behavior may be
a problem for cooperative reassignment. the issue is that the old subscription may contain
a set of owned partitions. the assignor will respect the owned set of partitions, but that
won't stop it from trying to move them to another consumer. in this case, we will set the
NEED_REJOIN error code. the idea is that consumers observe this error, revoke any needed partitions
and immediately rejoin. but if the static member just continues using its old subscription,
then we'll be stuck in rebalance state until the static member comes back online, because the
non-responsive static member won't give up subscription.

Some ideas proposed by Jason:

1. make revocation optional. basically get rid of the internal REJOIN_NEEDED error code. consumers
only rebalance if they revoke partitions themselves or detect the group rebalancing. in this
case, the static member would just decline to give up its partitions until it is back online.
2. make the assignor aware of which members are active in the current rebalance. if a static
member is not active, then the assignor can just not reassign any of its owned partitions.
it might be a good idea to have this anyway because rebalances are often used as a (clumsy)
way to collect information from the group members. for example, when connect rebalances a
group, it is looking for consistency among the members on the config offset that have read.
if one member is just reporting old state, then this protocol won't work.

This message was sent by Atlassian JIRA

View raw message