flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-2289) Make JobManager highly available
Date Mon, 29 Jun 2015 08:58:04 GMT
Till Rohrmann created FLINK-2289:

             Summary: Make JobManager highly available
                 Key: FLINK-2289
                 URL: https://issues.apache.org/jira/browse/FLINK-2289
             Project: Flink
          Issue Type: Improvement
            Reporter: Till Rohrmann
            Assignee: Till Rohrmann

Currently, the {{JobManager}} is the single point of failure in the Flink system. If it fails,
then your job cannot be recovered and the Flink cluster is no longer able to receive new jobs.

Therefore, it is crucial to make the {{JobManager}} fault tolerant so that the Flink cluster
can recover from failed {{JobManager}}. As a first step towards this goal, I propose to make
the {{JobManager}} highly available by starting multiple instances and using Apache ZooKeeper
to elect a leader. The leader is responsible for the execution of the Flink job. 

In case that the {{JobManager}} dies, one of the other running {{JobManager}} will be elected
as the leader and take over the role of the leader. The {{Client}} and the {{TaskManager}}
will automatically detect the new {{JobManager}} by querying the ZooKeeper cluster.

Note that this does not achieve full fault tolerance for the {{JobManager}} but it allows
the cluster to recover from failed {{JobManager}}. The design of high-availability for the
{{JobManager}} is tracked in the wiki here [1].

[1] [https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability]

This message was sent by Atlassian JIRA

View raw message