flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ufuk Celebi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-3411) Failed recovery can lead to removal of HA state
Date Wed, 17 Feb 2016 14:03:18 GMT

     [ https://issues.apache.org/jira/browse/FLINK-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ufuk Celebi updated FLINK-3411:
-------------------------------
    Description: 
When a job is recovered by a standby job manager and the recovery of the checkpoint state
or job fails, the job might be eventually removed by the job manager after all retries are
exhausted. This leads to the removal of the job/checkpoint state in ZooKeeper and the state
backend, making it impossible to ever recover the job again.

We should never exhaust job retries in the HA case.

  was:When a job is recovered by a standby job manager and the recovery of the checkpoint
state or job fails, the job will be removed by the job manager. This leads to the removal
of the job/checkpoint state in ZooKeeper and the state backend, making it impossible to ever
recover the job again.


> Failed recovery can lead to removal of HA state
> -----------------------------------------------
>
>                 Key: FLINK-3411
>                 URL: https://issues.apache.org/jira/browse/FLINK-3411
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>            Reporter: Ufuk Celebi
>            Priority: Critical
>
> When a job is recovered by a standby job manager and the recovery of the checkpoint state
or job fails, the job might be eventually removed by the job manager after all retries are
exhausted. This leads to the removal of the job/checkpoint state in ZooKeeper and the state
backend, making it impossible to ever recover the job again.
> We should never exhaust job retries in the HA case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message