flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-1668) Add a config option to specify delays between restarts
Date Mon, 09 Mar 2015 17:37:38 GMT
Stephan Ewen created FLINK-1668:

             Summary: Add a config option to specify delays between restarts
                 Key: FLINK-1668
                 URL: https://issues.apache.org/jira/browse/FLINK-1668
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 0.9
            Reporter: Stephan Ewen
            Assignee: Stephan Ewen
             Fix For: 0.9

The system currently introduces a short delay between a failed task execution and the restarted

The reason is that this delay seemed to help in letting problems surface that let to the failed
task. As an example, if a TaskManager fails, tasks fail due to data transfer errors. The TaskManager
is not immediately recognized as failed, though (takes a bit until heartbeats time out). Immediately
re-deploying tasks has a very high chance of assigning work to the TaskManager that is actually
not responding, causing the execution retry to fail again. The delay gives the system time
to figure out that the TaskManager was lost and does not take it into account upon the retry.

Currently, the system uses the heartbeat timeout as the default delay value. This may make
sense as a default value for critical task failures, but is actually quite high for other
types of failures.

In any case, I would like to add an option for users to specify the delay (even set it to
0, if desired).

The delay is not the best solution, in my opinion, we should eventually move to something
better. Ideas are to put TaskManagers responsible for failed tasks in a "probationary" mode
until they have reported back that everything is good (still alive, disk space available,

This message was sent by Atlassian JIRA

View raw message