kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantine Karantasis (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (KAFKA-9849) Fix issue with worker.unsync.backoff.ms creating zombie workers when incremental cooperative rebalancing is used
Date Wed, 10 Jun 2020 07:18:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Konstantine Karantasis resolved KAFKA-9849.
-------------------------------------------
    Resolution: Fixed

> Fix issue with worker.unsync.backoff.ms creating zombie workers when incremental cooperative
rebalancing is used
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9849
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9849
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.3.1, 2.5.0, 2.4.1
>            Reporter: Konstantine Karantasis
>            Assignee: Konstantine Karantasis
>            Priority: Major
>             Fix For: 2.3.2, 2.6.0, 2.4.2, 2.5.1
>
>
> {{worker.unsync.backoff.ms}} is a property that was introduced a while ago when eager
(stop-the-world) rebalancing was the only option for Connect workers. The goal of this property
is to avoid triggering consecutive rebalances when a worker fails to catch up with the config
topic in time and therefore voluntarily leaves the group with a {{LeaveGroupRequest}}.
> With incremental cooperative rebalancing this backoff ({{worker.unsync.backoff.ms) }}that
has a default value equal to the default value of {{scheduled.rebalance.max.delay.ms}} (5min)
might end up turning a worker into a zombie worker that retains its tasks but stays out of
the group. This worker, by backing off from rebalancing, leaves not option to the leader of
the group but to reassign the missing tasks that were thought as lost to other members of
the group if the worker that backs off does not return in time before {{scheduled.rebalance.max.delay.ms}} expires. 
> Clearly, {{worker.unsync.backoff.ms}} was introduced to avoid rebalancing storms under
the presence of intermittent connectivity issues with eager rebalancing. However when incremental
cooperative rebalancing is used this property might inadvertently make workers operate as
zombie workers that keep running tasks while they are out of the group.
> Of course, a good tradeoff needs to be made between avoiding to make the protocol too
eager again and at the same time avoiding to turn workers into zombies when connection is
not lost for too long from the broker coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message