flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
Date Wed, 01 Mar 2017 11:24:45 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889998#comment-15889998

ASF GitHub Bot commented on FLINK-4810:

Github user ramkrish86 commented on the issue:

    Thanks for the input. I read the code. There are two ways a checkpoint fails (as per my
code understanding). If for some reason checkpointing cannot be performed we send DeclineCheckpoint
message. That is handled by the Checkpointcoordinator.
    Another is if there is an external error in checkpointing and in that case we call failExternally.
Which transitions the state to FAILED and closes all the watchdog, and cancels the invokable
also. Now is the intent to track how many times this happens and if so track such occurences
of failure and then fail the execution graph?

> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
> ------------------------------------------------------------------------------------
>                 Key: FLINK-4810
>                 URL: https://issues.apache.org/jira/browse/FLINK-4810
>             Project: Flink
>          Issue Type: Sub-task
>          Components: State Backends, Checkpointing
>            Reporter: Stephan Ewen
> The Checkpoint coordinator should track the number of consecutive unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should call {{fail()}}
on the execution graph to trigger a recovery.

This message was sent by Atlassian JIRA

View raw message