Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E95E4200C2A for ; Wed, 1 Mar 2017 12:24:51 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id E7DDF160B70; Wed, 1 Mar 2017 11:24:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3BE90160B5E for ; Wed, 1 Mar 2017 12:24:51 +0100 (CET) Received: (qmail 46623 invoked by uid 500); 1 Mar 2017 11:24:50 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 46614 invoked by uid 99); 1 Mar 2017 11:24:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Mar 2017 11:24:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id D8AE2C1F20 for ; Wed, 1 Mar 2017 11:24:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.451 X-Spam-Level: * X-Spam-Status: No, score=1.451 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id u1zyFvEtTSeG for ; Wed, 1 Mar 2017 11:24:49 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 6E67E5F1A1 for ; Wed, 1 Mar 2017 11:24:48 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 50AF0E05B1 for ; Wed, 1 Mar 2017 11:24:46 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6D07324160 for ; Wed, 1 Mar 2017 11:24:45 +0000 (UTC) Date: Wed, 1 Mar 2017 11:24:45 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 01 Mar 2017 11:24:52 -0000 [ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889998#comment-15889998 ] ASF GitHub Bot commented on FLINK-4810: --------------------------------------- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 Thanks for the input. I read the code. There are two ways a checkpoint fails (as per my code understanding). If for some reason checkpointing cannot be performed we send DeclineCheckpoint message. That is handled by the Checkpointcoordinator. Another is if there is an external error in checkpointing and in that case we call failExternally. Which transitions the state to FAILED and closes all the watchdog, and cancels the invokable also. Now is the intent to track how many times this happens and if so track such occurences of failure and then fail the execution graph? > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints > ------------------------------------------------------------------------------------ > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing > Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)