Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A450317383 for ; Mon, 9 Mar 2015 17:37:59 +0000 (UTC) Received: (qmail 35431 invoked by uid 500); 9 Mar 2015 17:37:59 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 35312 invoked by uid 500); 9 Mar 2015 17:37:59 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 35292 invoked by uid 99); 9 Mar 2015 17:37:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2015 17:37:59 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 09 Mar 2015 17:37:58 +0000 Received: (qmail 35039 invoked by uid 99); 9 Mar 2015 17:37:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2015 17:37:38 +0000 Date: Mon, 9 Mar 2015 17:37:38 +0000 (UTC) From: "Stephan Ewen (JIRA)" To: dev@flink.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (FLINK-1668) Add a config option to specify delays between restarts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Stephan Ewen created FLINK-1668: ----------------------------------- Summary: Add a config option to specify delays between restarts Key: FLINK-1668 URL: https://issues.apache.org/jira/browse/FLINK-1668 Project: Flink Issue Type: Improvement Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The system currently introduces a short delay between a failed task execution and the restarted execution. The reason is that this delay seemed to help in letting problems surface that let to the failed task. As an example, if a TaskManager fails, tasks fail due to data transfer errors. The TaskManager is not immediately recognized as failed, though (takes a bit until heartbeats time out). Immediately re-deploying tasks has a very high chance of assigning work to the TaskManager that is actually not responding, causing the execution retry to fail again. The delay gives the system time to figure out that the TaskManager was lost and does not take it into account upon the retry. Currently, the system uses the heartbeat timeout as the default delay value. This may make sense as a default value for critical task failures, but is actually quite high for other types of failures. In any case, I would like to add an option for users to specify the delay (even set it to 0, if desired). The delay is not the best solution, in my opinion, we should eventually move to something better. Ideas are to put TaskManagers responsible for failed tasks in a "probationary" mode until they have reported back that everything is good (still alive, disk space available, etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)