kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
Date Wed, 02 Aug 2017 00:11:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110017#comment-16110017
] 

ASF GitHub Bot commented on KAFKA-5152:
---------------------------------------

GitHub user guozhangwang opened a pull request:

    https://github.com/apache/kafka/pull/3607

    [DO NOT MERGE] Existing StreamThread exception handling issues

    This is for @dguy as a reference while working on the first step of KAFKA-5152, as a list
of existing issues that need to be address at stream thread layer.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/guozhangwang/kafka KMinor-stream-thread-exception-handling

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/3607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3607
    
----
commit 2d45430191c3dc417992c08454f9c550c1e6bb93
Author: Guozhang Wang <wangguoz@gmail.com>
Date:   2017-07-25T22:43:18Z

    handle commit failed exception on stream thread

commit 9655791794dfa2623ba9f109676b112779fdceca
Author: Guozhang Wang <wangguoz@gmail.com>
Date:   2017-07-26T00:39:07Z

    minor fixes

commit 26226d61529007acc0ccc151e6f6675fc9757d34
Author: Guozhang Wang <wangguoz@gmail.com>
Date:   2017-07-26T01:01:57Z

    add a bunch of TODOs for exception handling

commit 3b054f556364be04d7f83a40b212e0c7facc4a23
Author: Guozhang Wang <wangguoz@gmail.com>
Date:   2017-07-27T22:09:43Z

    rebase from trunk

commit 4b0f4f9cb30537fd0b45b192e2a5d81005ffa3c5
Author: Guozhang Wang <wangguoz@gmail.com>
Date:   2017-07-27T22:26:46Z

    minor fixes

commit 5d2dffa72443139909d3e28f1684363a6e6f5585
Author: Guozhang Wang <wangguoz@gmail.com>
Date:   2017-07-27T22:29:39Z

    github comments

commit 41ba5721ec9fe88b91416621a6236794d37a74de
Author: Guozhang Wang <wangguoz@gmail.com>
Date:   2017-08-02T00:08:13Z

    rebase from trunk

----


> Kafka Streams keeps restoring state after shutdown is initiated during startup
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-5152
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5152
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 0.10.2.1
>            Reporter: Xavier Léauté
>            Assignee: Matthias J. Sax
>            Priority: Blocker
>             Fix For: 0.10.2.2, 0.11.0.1
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught exception is
thrown) streams will not shut down until all stores are first finished restoring.
> As restore progresses, stream threads appear to be taken out of service as part of the
shutdown sequence, causing rebalancing of tasks. This compounds the problem by slowing down
the restore process even further, since the remaining threads now have to also restore the
reassigned tasks before they can shut down.
> A more severe issue is that if there is a new rebalance triggered during the end of the
waitingSync phase (e.g. due to a new member joining the group, or some members timed out the
SyncGroup response), then some consumer clients of the group may already proceed with the
{{onPartitionsAssigned}} and blocked on trying to grab the file dir lock not yet released
from other clients, while the other clients holding the lock are consistently re-sending {{JoinGroup}}
requests while the rebalance cannot be completed because the clients blocked on the file dir
lock will not be kicked out of the group as its heartbeat thread has been consistently sending
HBRequest. Hence this is a deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message