kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eno Thereska (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (KAFKA-5571) Possible deadlock during shutdown in setState in kafka streams 10.2
Date Sun, 20 Aug 2017 07:33:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Eno Thereska resolved KAFKA-5571.
    Resolution: Fixed

> Possible deadlock during shutdown in setState in kafka streams 10.2
> -------------------------------------------------------------------
>                 Key: KAFKA-5571
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5571
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions:
>            Reporter: Greg Fodor
>            Assignee: Eno Thereska
>         Attachments: kafka-streams.deadlock.log
> I'm running a 10.2 job across 5 nodes with 32 stream threads on each node and find that
when gracefully shutdown all of them at once via an ansible scripts, some of the nodes end
up freezing -- at a glance the attached thread dump implies a deadlock between stream threads
trying to update their state via setState. We haven't had this problem before but it may or
may not be related to changes in 10.2 (we are upgrading from 10.0 to 10.2)
> when we gracefully shutdown all nodes simultaneously, what typically happens is some
subset of the nodes end up not shutting down completely but end up going through a rebalance
first. it seems this deadlock requires this rebalancing to occur simultaneously with the graceful
shutdown. if we happen to shut them down and no rebalance happens, i don't believe this deadlock
is triggered.
> the deadlock appears related to the state change handlers being subscribed across threads
and the fact that both StreamThread#setState and StreamStateListener#onChange are both synchronized
> Another thing worth mentioning is that one of the transformers used in the job has a
close() method that can take 10-15 seconds to finish since it needs to flush some data to
a database. Having a long close() method combined with a rebalance during a shutdown across
many threads may be necessary for reproduction.

This message was sent by Atlassian JIRA

View raw message