flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijiang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4911) Non-disruptive JobManager Failures via Reconciliation
Date Thu, 27 Oct 2016 08:59:59 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611247#comment-15611247

Zhijiang Wang commented on FLINK-4911:

Yeah, it does "at-least once". 

BTW, I confirmed with Till that the heartbeat manager module has finished and the previous
blocked jiras can be resumed. Some are assigned to me before related with heartbeat interaction
between TM, JM and RM. I will work on them in the following days and then consider the JM
failure issue, and I think the payload informations reported in the heartbeat messages maybe
reused in JM failure scenario.

> Non-disruptive JobManager Failures via Reconciliation 
> ------------------------------------------------------
>                 Key: FLINK-4911
>                 URL: https://issues.apache.org/jira/browse/FLINK-4911
>             Project: Flink
>          Issue Type: New Feature
>          Components: JobManager, TaskManager
>            Reporter: Stephan Ewen
>            Assignee: Zhijiang Wang
> JobManager failures can be handled in a non-disruptive way - by *reconciling* the new
JobManager leader and the TaskManagers.
> I suggest to use this term (reconcile)  - it has been uses also by other frameworks (like
Mesos) for non-disruptive handling of failures.
> The basic approach is the following:
>   - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect
to the JobManager
>   - On connect, the TaskManager tells the JobManager about its currently running tasks
>   - A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs
the ExecutionGraph state from these reports
>   - Tasks whose status was not reconstructed in a certain time are assumed failed and
trigger regular task recovery.
> To avoid having to re-implement this for the new JobManager / TaskManager approach in
*flip-6*, I suggest to directly implement this into the {{flip-6}} feature branch.

This message was sent by Atlassian JIRA

View raw message