kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajini Sivaram (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (KAFKA-5395) Distributed Herder Deadlocks on Shutdown
Date Wed, 07 Jun 2017 08:23:18 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rajini Sivaram reassigned KAFKA-5395:
-------------------------------------

    Assignee: Rajini Sivaram

> Distributed Herder Deadlocks on Shutdown
> ----------------------------------------
>
>                 Key: KAFKA-5395
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5395
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 0.10.2.1
>            Reporter: Michael Jaschob
>            Assignee: Rajini Sivaram
>            Priority: Critical
>             Fix For: 0.11.0.0
>
>         Attachments: connect_01021_shutdown_deadlock.txt
>
>
> We're trying to upgrade Kafka Connect to 0.10.2.1 and see that the process does not shut
down cleanly. It hangs instead. From what I can tell [KAFKA-4786|https://github.com/apache/kafka/commit/ba4eafa7874988374abcd9f48fbab96abb2032a4]
introduced this deadlock.
> [close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
on the AbstractCoordinator is marked as synchronized and acquires the coordinator's monitor.
The first thing it tries to do is [join|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L323]
the heartbeat thread.
> Meanwhile, the heartbeat thread is [synchronized on the same monitor|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L891],
which it relinquishes when it [waits|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L926].
But for the wait to return (and the run method of the heartbeat to terminate) it needs to
reacquire that monitor.
> There's no way for the heartbeat thread to reacquire the monitor since it is held by
the distributed herder thread. And the distributed herder will never relinquish the monitor
since it is waiting for the heartbeat thread to join.
> I am attaching a thread dump illustrating the situation. Take note in particular of threads
#178 (the heartbeat thread) and #159 (the herder thread). The former is BLOCKED trying to
reacquire 0x00000007406cc0c0, and the latter is WAITING on the heartbeat thread to join, having
itself acquired 0x00000007406cc0c0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message