flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (FLINK-3345) Restart TaskManager in case of a Akka quarantine event
Date Fri, 05 Feb 2016 15:07:39 GMT

     [ https://issues.apache.org/jira/browse/FLINK-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Till Rohrmann closed FLINK-3345.
    Resolution: Duplicate

Duplicate of issue FLINK-3347

> Restart TaskManager in case of a Akka quarantine event
> ------------------------------------------------------
>                 Key: FLINK-3345
>                 URL: https://issues.apache.org/jira/browse/FLINK-3345
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Runtime
>    Affects Versions: 1.0.0
>            Reporter: Till Rohrmann
> {{ActorSystems}} which get quarantined (death watch trigger, system message failure)
are not able to reconnect to quarantining {{ActorSystem}}. In order to do that, the quarantined
{{ActorSystem}} has to be restarted.
> This is a problem for the {{TaskManager}}-{{JobManager}} communication. Whenever a {{TaskManager}}
gets quarantined it is effectively useless for the Flink cluster, because it cannot reconnect
to the {{JobManager}}. In such a case, the {{TaskManager}} would have to be restarted. 
> The following link [1] describes how an {{ActorSystem}} can detect that it got quarantined.
> When the TM detects that it got quarantined it should shut itself down. In order to restart
the TM we could add a retry loop to the `taskmanager.sh` start script which restarts a TM
in case of a non-zero return code.
> [1] http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state

This message was sent by Atlassian JIRA

View raw message