flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-4911) Non-disruptive JobManager Failures via Reconciliation
Date Tue, 25 Oct 2016 14:50:58 GMT

     [ https://issues.apache.org/jira/browse/FLINK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Stephan Ewen updated FLINK-4911:
    Summary: Non-disruptive JobManager Failures via Reconciliation   (was: Non-disruptive
JobManager Failures)

> Non-disruptive JobManager Failures via Reconciliation 
> ------------------------------------------------------
>                 Key: FLINK-4911
>                 URL: https://issues.apache.org/jira/browse/FLINK-4911
>             Project: Flink
>          Issue Type: New Feature
>          Components: JobManager
>            Reporter: Stephan Ewen
> JobManager failures can be handled in a non-disruptive way - by *reconciling* the new
JobManager leader and the TaskManagers.
> I suggest to use this term (reconcile)  - it has been uses also by other frameworks (like
Mesos) for non-disruptive handling of failures.
> The basic approach is the following:
>   - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect
to the JobManager
>   - On connect, the TaskManager tells the JobManager about its currently running tasks
>   - A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs
the ExecutionGraph state from these reports
>   - Tasks whose status was not reconstructed in a certain time are assumed failed and
trigger regular task recovery.
> To avoid having to re-implement this for the new JobManager / TaskManager approach in
*flip-6*, I suggest to directly implement this into the {{flip-6}} feature branch.

This message was sent by Atlassian JIRA

View raw message