flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijiang(wangzhijiang999)" <wangzhijiang...@aliyun.com>
Subject 回复:TaskManager failure detection
Date Thu, 23 Feb 2017 03:22:12 GMT
Hi Dominik,
     As I know, the JobManager would detect the failure of TaskManager by akka watch mechanism.
It is similar with heartbeat or ping way in network stack.You can refer to this link "https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors".
 Futhermore, the upstream and downstream task related with the failed TaskManagerwould also
be aware of the network inactive and resulting in task failure. And the failed task state
will be notified to JobManager and trigger restarting the job and restoring the state.
    As you said, the CheckpointCoordinator component in JobManager is in charge of recovering
state from complete checkpoint, and the state would be set onto Execution in ExecutionGraph.
For yarn cluster mode, ------------------------------------------------------------------发件人:Dominik
Safaric <dominiksafaric@gmail.com>发送时间:2017年2月22日(星期三) 19:05收件人:user
<user@flink.apache.org>主 题:TaskManager failure detection

As I’m investigating onto Flink’s fault tolerance capabilities, I would like to know what component and class is in charge of TaskManager failure detection and checkpoint restoring? In addition, how does Flink actually determine that a TaskManager has failed due to e.g. hardware failures? 

Up to my knowledge, the state should be restored using the CheckpointCoordinator or ExecutionGraph. Correct me if I’m wrong. 

Thanks in advance,
View raw message