Mailing-List: contact dev-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
Date: Mon, 9 Mar 2015 17:37:38 +0000 (UTC)
From: "Stephan Ewen (JIRA)" <jira@apache.org>
To: dev@flink.incubator.apache.org
Message-ID: <JIRA.12780544.1425922640000.38748.1425922658255@Atlassian.JIRA>
In-Reply-To: <JIRA.12780544.1425922640000@Atlassian.JIRA>
References: <JIRA.12780544.1425922640000@Atlassian.JIRA>
 <JIRA.12780544.1425922640726@arcas>
Subject: [jira] [Created] (FLINK-1668) Add a config option to specify delays
 between restarts
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Stephan Ewen created FLINK-1668:
-----------------------------------

             Summary: Add a config option to specify delays between restarts
                 Key: FLINK-1668
                 URL: https://issues.apache.org/jira/browse/FLINK-1668
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 0.9
            Reporter: Stephan Ewen
            Assignee: Stephan Ewen
             Fix For: 0.9


The system currently introduces a short delay between a failed task execution and the restarted execution.

The reason is that this delay seemed to help in letting problems surface that let to the failed task. As an example, if a TaskManager fails, tasks fail due to data transfer errors. The TaskManager is not immediately recognized as failed, though (takes a bit until heartbeats time out). Immediately re-deploying tasks has a very high chance of assigning work to the TaskManager that is actually not responding, causing the execution retry to fail again. The delay gives the system time to figure out that the TaskManager was lost and does not take it into account upon the retry.

Currently, the system uses the heartbeat timeout as the default delay value. This may make sense as a default value for critical task failures, but is actually quite high for other types of failures.

In any case, I would like to add an option for users to specify the delay (even set it to 0, if desired).

The delay is not the best solution, in my opinion, we should eventually move to something better. Ideas are to put TaskManagers responsible for failed tasks in a "probationary" mode until they have reported back that everything is good (still alive, disk space available, etc)


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)