kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guozhang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
Date Tue, 18 Jul 2017 23:18:01 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Guozhang Wang updated KAFKA-5152:
---------------------------------
    Description: 
If streams shutdown is initiated during state restore (e.g. an uncaught exception is thrown)
streams will not shut down until all stores are first finished restoring.

As restore progresses, stream threads appear to be taken out of service as part of the shutdown
sequence, causing rebalancing of tasks. This compounds the problem by slowing down the restore
process even further, since the remaining threads now have to also restore the reassigned
tasks before they can shut down.

A more severe issue is that if there is a new rebalance triggered during the end of the waitingSync
phase (e.g. due to a new member joining the group, or some members timed out the SyncGroup
response), then some consumer clients of the group may already proceed with the {{onPartitionsAssigned}}
and blocked on trying to grab the file dir lock not yet released from other clients, while
the other clients holding the lock are consistently re-sending {{JoinGroup}} requests while
the rebalance cannot be completed because the clients blocked on the file dir lock will not
be kicked out of the group as its heartbeat thread has been consistently sending HBRequest.
Hence this is a deadlock caused by not releasing the file dir locks in task suspension.

  was:
If streams shutdown is initiated during state restore (e.g. an uncaught exception is thrown)
streams will not shut down until all stores are first finished restoring.

As restore progresses, stream threads appear to be taken out of service as part of the shutdown
sequence, causing rebalancing of tasks. This compounds the problem by slowing down the restore
process even further, since the remaining threads now have to also restore the reassigned
tasks before they can shut down.


> Kafka Streams keeps restoring state after shutdown is initiated during startup
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-5152
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5152
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 0.10.2.1
>            Reporter: Xavier Léauté
>            Assignee: Matthias J. Sax
>             Fix For: 0.10.2.2, 0.11.0.1
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught exception is
thrown) streams will not shut down until all stores are first finished restoring.
> As restore progresses, stream threads appear to be taken out of service as part of the
shutdown sequence, causing rebalancing of tasks. This compounds the problem by slowing down
the restore process even further, since the remaining threads now have to also restore the
reassigned tasks before they can shut down.
> A more severe issue is that if there is a new rebalance triggered during the end of the
waitingSync phase (e.g. due to a new member joining the group, or some members timed out the
SyncGroup response), then some consumer clients of the group may already proceed with the
{{onPartitionsAssigned}} and blocked on trying to grab the file dir lock not yet released
from other clients, while the other clients holding the lock are consistently re-sending {{JoinGroup}}
requests while the rebalance cannot be completed because the clients blocked on the file dir
lock will not be kicked out of the group as its heartbeat thread has been consistently sending
HBRequest. Hence this is a deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message