flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9351) RM stop assigning slot to Job because the TM killed before connecting to JM successfully
Date Thu, 07 Jun 2018 06:22:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504291#comment-16504291
] 

ASF GitHub Bot commented on FLINK-9351:
---------------------------------------

Github user sihuazhou commented on the issue:

    https://github.com/apache/flink/pull/6133
  
    cc @tillrohrmann 


> RM stop assigning slot to Job because the TM killed before connecting to JM successfully
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-9351
>                 URL: https://issues.apache.org/jira/browse/FLINK-9351
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Sihua Zhou
>            Assignee: Sihua Zhou
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> The steps are the following(copied from Stephan's comments in [5931|https://github.com/apache/flink/pull/5931]):
> - JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
> - ResourceManager starts a container with a TaskManager
> - TaskManager registers at ResourceManager, which tells the TaskManager to push a slot
to the JobManager.
> - TaskManager container is killed
> - The ResourceManager does not queue back the slot requests (AllocationIDs) that it sent
to the previous TaskManager, so the requests are lost and need to time out before another
attempt is tried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message