Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2CC99200CCA for ; Tue, 4 Jul 2017 17:17:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 2B5A81616D5; Tue, 4 Jul 2017 15:17:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7AA761616D3 for ; Tue, 4 Jul 2017 17:17:05 +0200 (CEST) Received: (qmail 50974 invoked by uid 500); 4 Jul 2017 15:17:04 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 50965 invoked by uid 99); 4 Jul 2017 15:17:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Jul 2017 15:17:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 98D2FC0620 for ; Tue, 4 Jul 2017 15:17:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id cpyjA6a79tVz for ; Tue, 4 Jul 2017 15:17:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 79F9C5F6C7 for ; Tue, 4 Jul 2017 15:17:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id AA485E0D7D for ; Tue, 4 Jul 2017 15:17:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 195C224609 for ; Tue, 4 Jul 2017 15:17:00 +0000 (UTC) Date: Tue, 4 Jul 2017 15:17:00 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-7067) Cancel with savepoint does not restart checkpoint scheduler on failure MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 04 Jul 2017 15:17:06 -0000 [ https://issues.apache.org/jira/browse/FLINK-7067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073807#comment-16073807 ] ASF GitHub Bot commented on FLINK-7067: --------------------------------------- GitHub user uce opened a pull request: https://github.com/apache/flink/pull/4254 [FLINK-7067] [jobmanager] Fix side effects after failed cancel-job-with-savepoint If a cancel-job-with-savepoint request fails, this has an unintended side effect on the respective job if it has periodic checkpoints enabled. The periodic checkpoint scheduler is stopped before triggering the savepoint, but not restarted if a savepoint fails and the job is not cancelled. This fix makes sure that the periodic checkpoint scheduler is restarted iff periodic checkpoints were enabled before. I have the test in a separate commit, because it uses Reflection to update a private field with a spied upon instance of the CheckpointCoordinator in order to test the expected behaviour. This is super fragile and ugly, but the alternatives require a large refactoring (use factories that can be set during tests) or don't test this corner case behaviour. The separate commit makes it easier to remove/revert it at a future point in time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 7067-restart_checkpoint_scheduler Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4254.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4254 ---- commit 7294de0ef77a346b7b38d4b3fcdc421f7fd6855b Author: Ufuk Celebi Date: 2017-07-04T14:39:02Z [tests] Reduce visibility of helper class methods There is no need to make the helper methods public. No other class should even use this inner test helper invokable. commit ce924bc146d3cf97e0c5ddcc1ba16610b2fc8d49 Author: Ufuk Celebi Date: 2017-07-04T14:53:54Z [FLINK-7067] [jobmanager] Add test for cancel-job-with-savepoint side effects I have this test in a separate commit, because it uses Reflection to update private field with a spied upon instance of the CheckpointCoordinator in order to test the expected behaviour. This makes it easier to remove/revert at a future point in time. This is super fragile and ugly, but the alternatives require a large refactoring (use factories that can be set during tests) or don't test this corner case behaviour. commit 94aa444cbd7099d7830e06efe3525a717becb740 Author: Ufuk Celebi Date: 2017-07-04T15:01:32Z [FLINK-7067] [jobmanager] Fix side effects after failed cancel-job-with-savepoint Problem: If a cancel-job-with-savepoint request fails, this has an unintended side effect on the respective job if it has periodic checkpoints enabled. The periodic checkpoint scheduler is stopped before triggering the savepoint, but not restarted if a savepoint fails and the job is not cancelled. This commit makes sure that the periodic checkpoint scheduler is restarted iff periodic checkpoints were enabled before. ---- > Cancel with savepoint does not restart checkpoint scheduler on failure > ---------------------------------------------------------------------- > > Key: FLINK-7067 > URL: https://issues.apache.org/jira/browse/FLINK-7067 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing > Affects Versions: 1.3.1 > Reporter: Ufuk Celebi > > The `CancelWithSavepoint` action of the JobManager first stops the checkpoint scheduler, then triggers a savepoint, and cancels the job after the savepoint completes. > If the savepoint fails, the command should not have any side effects and we don't cancel the job. The issue is that the checkpoint scheduler is not restarted though. -- This message was sent by Atlassian JIRA (v6.4.14#64029)