flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-4911) Non-disruptive JobManager Failures
Date Tue, 25 Oct 2016 14:05:58 GMT
Stephan Ewen created FLINK-4911:
-----------------------------------

             Summary: Non-disruptive JobManager Failures
                 Key: FLINK-4911
                 URL: https://issues.apache.org/jira/browse/FLINK-4911
             Project: Flink
          Issue Type: New Feature
          Components: JobManager
            Reporter: Stephan Ewen


JobManager failures can be handled in a non-disruptive way.

The basic approach is the following:

  - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect to
the JobManager
  - On connect, the TaskManager tells the JobManager about its currently running tasks
  - A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs
the ExecutionGraph state from these reports
  - Tasks whose status was not reconstructed in a certain time are assumed failed and trigger
regular task recovery.

To avoid having to re-implement this for the new JobManager / TaskManager approach in **flip-6**,
I suggest to directly implement this into the {{flip-6}} feature branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message