flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5667) Possible state data loss when task fails while checkpointing
Date Fri, 27 Jan 2017 19:06:24 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843334#comment-15843334

ASF GitHub Bot commented on FLINK-5667:

Github user tillrohrmann commented on a diff in the pull request:

    --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java
    @@ -947,11 +951,17 @@ public void run() {
    -				owner.getEnvironment().acknowledgeCheckpoint(checkpointMetaData, subtaskState);
    +				if (asyncCheckpointState.compareAndSet(CheckpointingOperation.AsynCheckpointState.RUNNING,
CheckpointingOperation.AsynCheckpointState.COMPLETED)) {
    +					owner.getEnvironment().acknowledgeCheckpoint(checkpointMetaData, subtaskState);
    --- End diff --
    In order to harden it, I'll reset the state to `RUNNING` in the failure case if it was
`COMPLETED`. Then cleanup should properly work.

> Possible state data loss when task fails while checkpointing
> ------------------------------------------------------------
>                 Key: FLINK-5667
>                 URL: https://issues.apache.org/jira/browse/FLINK-5667
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.0, 1.3.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>             Fix For: 1.2.0, 1.3.0
> It is possible that Flink loses state data when a {{Task}} fails while a checkpoint is
being drawn. The scenario is the following:
> Flink has finished the synchronous checkpointing part and starts the asynchronous part
by creating and submitting a {{AsyncCheckpointRunnable}} to an {{Executor}}. This runnable
is also registered at the closeable registry. If the {{Task}} now fails before the {{AsyncCheckpointRunnable}}
has completed, it will be closed due to being registered in the closeable registry. The closing
operation will discard all state handles and cancel all runnable state futures. However, it
will not stop the runnable from sending an acknowledge message to the {{CheckpointCoordinator}}.
> If this message completes the pending checkpoint, then this checkpoint will be transformed
into a {{CompletedCheckpoint}} which is faulty (some of the data has already been deleted).
Depending on Flink's configuration, this will discard older completed checkpoints and thus
we will have state data loss.

This message was sent by Atlassian JIRA

View raw message