Mailing-List: contact issues-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
Date: Tue, 4 Jul 2017 15:17:00 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: issues@flink.apache.org
Message-ID: <JIRA.13084229.1499083631000.167439.1499181420103@Atlassian.JIRA>
In-Reply-To: <JIRA.13084229.1499083631000@Atlassian.JIRA>
References: <JIRA.13084229.1499083631000@Atlassian.JIRA> <JIRA.13084229.1499083631880@jira-lw-us.apache.org>
Subject: [jira] [Commented] (FLINK-7067) Cancel with savepoint does not
 restart checkpoint scheduler on failure
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Tue, 04 Jul 2017 15:17:06 -0000


    [ https://issues.apache.org/jira/browse/FLINK-7067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073807#comment-16073807 ] 

ASF GitHub Bot commented on FLINK-7067:
---------------------------------------

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/4254

    [FLINK-7067] [jobmanager] Fix side effects after failed cancel-job-with-savepoint

    If a cancel-job-with-savepoint request fails, this has an unintended side effect on the respective job if it has periodic checkpoints enabled. The periodic checkpoint scheduler is stopped before triggering the savepoint, but not restarted if a savepoint fails and the job is not cancelled.
    
    This fix makes sure that the periodic checkpoint scheduler is restarted iff periodic checkpoints were enabled before.
    
    I have the test in a separate commit, because it uses Reflection to update a private field with a spied upon instance of the CheckpointCoordinator in order to test the expected behaviour. This is super fragile and ugly, but the alternatives require a large refactoring (use factories that can be set during tests) or don't test this corner case behaviour. The separate commit makes it easier to remove/revert it at a future point in time.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 7067-restart_checkpoint_scheduler

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4254.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4254
    
----
commit 7294de0ef77a346b7b38d4b3fcdc421f7fd6855b
Author: Ufuk Celebi <uce@apache.org>
Date:   2017-07-04T14:39:02Z

    [tests] Reduce visibility of helper class methods
    
    There is no need to make the helper methods public. No other class
    should even use this inner test helper invokable.

commit ce924bc146d3cf97e0c5ddcc1ba16610b2fc8d49
Author: Ufuk Celebi <uce@apache.org>
Date:   2017-07-04T14:53:54Z

    [FLINK-7067] [jobmanager] Add test for cancel-job-with-savepoint side effects
    
    I have this test in a separate commit, because it uses Reflection
    to update private field with a spied upon instance of the
    CheckpointCoordinator in order to test the expected behaviour. This
    makes it easier to remove/revert at a future point in time.
    
    This is super fragile and ugly, but the alternatives require a
    large refactoring (use factories that can be set during tests)
    or don't test this corner case behaviour.

commit 94aa444cbd7099d7830e06efe3525a717becb740
Author: Ufuk Celebi <uce@apache.org>
Date:   2017-07-04T15:01:32Z

    [FLINK-7067] [jobmanager] Fix side effects after failed cancel-job-with-savepoint
    
    Problem: If a cancel-job-with-savepoint request fails, this has an
    unintended side effect on the respective job if it has periodic
    checkpoints enabled. The periodic checkpoint scheduler is stopped
    before triggering the savepoint, but not restarted if a savepoint
    fails and the job is not cancelled.
    
    This commit makes sure that the periodic checkpoint scheduler is
    restarted iff periodic checkpoints were enabled before.

----


> Cancel with savepoint does not restart checkpoint scheduler on failure
> ----------------------------------------------------------------------
>
>                 Key: FLINK-7067
>                 URL: https://issues.apache.org/jira/browse/FLINK-7067
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.3.1
>            Reporter: Ufuk Celebi
>
> The `CancelWithSavepoint` action of the JobManager first stops the checkpoint scheduler, then triggers a savepoint, and cancels the job after the savepoint completes.
> If the savepoint fails, the command should not have any side effects and we don't cancel the job. The issue is that the checkpoint scheduler is not restarted though.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)