flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-7783) Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()
Date Fri, 20 Oct 2017 09:11:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212398#comment-16212398

ASF GitHub Bot commented on FLINK-7783:

GitHub user aljoscha opened a pull request:


    [FLINK-7783] Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()

    Alternative version of #4863.
    This one actually works. #4863 is not working because I was deserialising checkpoints
on demand which is problematic because before checkpoints were registered at the `SharedStateRegistry`.
If we deserialise a checkpoint on demand and call dispose on it (as #4863 does) this will
potentially remove shared state handles that are needed by the other handles.
    This version also fails as soon as one handle cannot be read. If we don't do this, we
will break other incremental state handles because we drop their shared state handles. 
    R: @StefanRRichter 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aljoscha/flink jira-7783-zookeeper-state-store-fix-simplified2

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4870
commit 4702bdd96a3baa844850bba47610c2a71ca7f2f1
Author: Aljoscha Krettek <aljoscha.krettek@gmail.com>
Date:   2017-10-19T19:26:20Z

    [FLINK-7783] Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()


> Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()
> ------------------------------------------------------------------------------
>                 Key: FLINK-7783
>                 URL: https://issues.apache.org/jira/browse/FLINK-7783
>             Project: Flink
>          Issue Type: Sub-task
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.4.0, 1.3.2
>            Reporter: Aljoscha Krettek
>            Assignee: Aljoscha Krettek
>            Priority: Blocker
>             Fix For: 1.4.0, 1.3.3
> Currently, we always delete checkpoint handles if they (or the data from the DFS) cannot
be read: https://github.com/apache/flink/blob/91a4b276171afb760bfff9ccf30593e648e91dfb/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L180
> This can lead to problems in case the DFS is temporarily not available, i.e. we could
> delete all checkpoints even though they are still valid.
> A user reported this problem on the mailing list: https://lists.apache.org/thread.html/9dc9b719cf8449067ad01114fedb75d1beac7b4dff171acdcc24903d@%3Cuser.flink.apache.org%3E

This message was sent by Atlassian JIRA

View raw message