flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-6328) Savepoints must not be counted as retained checkpoints
Date Mon, 22 May 2017 14:55:04 GMT

    [ https://issues.apache.org/jira/browse/FLINK-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019663#comment-16019663
] 

Till Rohrmann commented on FLINK-6328:
--------------------------------------

Given that the lifecycle of a savepoint is out of control of the {{CheckpointCoordinator}},
I think it is best to not add savepoints to the {{CompletedCheckpointStore}} and, thus, not
considering them for job recovery. The reason for this is FLINK-4815, because otherwise a
single broken/deleted savepoint will thwart Flink's whole recovery mechanism.

Once FLINK-4815 has been added we might think again about re-adding savepoints to the {{CompletedCheckpointStore}}
and, thus, allowing to recover from savepoints in case of failures. When doing so, we should,
however, not count the savepoints for the number of retained checkpoints, because we cannot
be sure that they still exist.

> Savepoints must not be counted as retained checkpoints
> ------------------------------------------------------
>
>                 Key: FLINK-6328
>                 URL: https://issues.apache.org/jira/browse/FLINK-6328
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>            Priority: Blocker
>             Fix For: 1.3.0, 1.2.2
>
>
> The Checkpoint Store retains the *n* latest checkpoints.
> Savepoints are counted as well, meaning that for settings with 1 retained checkpoint,
there are sometimes no retained checkpoints at all, only a savepoint.
> That is dangerous, because savepoints must be assumed to disappear at any point in time
- their lifecycle is out of control of the CheckpointCoordinator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message