flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1376) SubSlots are not properly released in case that a TaskManager fatally fails, leaving the system in a corrupted state
Date Mon, 12 Jan 2015 17:38:34 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273814#comment-14273814
] 

ASF GitHub Bot commented on FLINK-1376:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/300

    Add [FLINK-1376] Add proper shared slot release in case of a fatal TaskManager failure

    This PR introduces SharedSlots as being a special Slot type and as such being released
properly in case an Instance has been marked dead. This fixes the problem that a dead instance,
which has not been shutdown properly, causes a job not being removed properly from the system,
because it is not aware of the SubSlots.
    
    Adds test cases where only the heartbeat is disabled to see if the job is properly failed.
    
    @StephanEwen: Would be great if you could take a close look at the code because of the
delicate synchronization mechanism. What I've done in the end is to synchronize most of the
calls by passing them through the SlotSharingGroupAssignment.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixSharedSlotRelease

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/300.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #300
    
----
commit 02004f98d1d76dc0683392690be38ab721bd6edd
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2015-01-12T09:58:45Z

    [FLINK-1376] [runtime] Add proper shared slot release in case of a fatal TaskManager failure.

----


> SubSlots are not properly released in case that a TaskManager fatally fails, leaving
the system in a corrupted state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-1376
>                 URL: https://issues.apache.org/jira/browse/FLINK-1376
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Till Rohrmann
>
> In case that the TaskManager fatally fails and some of the failing node's slots are SharedSlots,
then the slots are not properly released by the JobManager. This causes that the corresponding
job will not be properly failed, leaving the system in a corrupted state.
> The reason for that is that the AllocatedSlot is not aware of being treated as a SharedSlot
and thus he cannot release the associated SubSlots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message