flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shuai-xu <...@git.apache.org>
Subject [GitHub] flink pull request #3539: [FLINK-4256] Flip1: fine gained recovery
Date Wed, 15 Mar 2017 06:30:29 GMT
GitHub user shuai-xu opened a pull request:


    [FLINK-4256] Flip1: fine gained recovery

    This is an informal pr for the implementation of flip1 version 1. 
    It enable that when a task fail, only restart the minimal pipelined connected executions
instead of the whole execution graph.
    Main changes:
    1. ExecutionGraph doesn't manage the failover any more, it only record the finished JobVertex
number and turn to FINISHED when all vertexes finish(maybe later FailoverCoordinator will
take over this). Its state can only be CREATED, RUNNING, FAILED, FINISHED or SUSPENDED now.
    2. FailoverCoordinator will manage the failover now. It will generate several FailoverRegions
when the EG is attached. It listens for the fail of executions. When an execution fail, it
finds a FailoverRegion to finish the failover.
    3. When JM need the EG to be canceled or failed, EG will also notice FailoverCoordinator,
FailoverCoordinator will notice all FailoverRegions to cancel their executions and when all
executions are canceled, FailoverCoordinator will notice EG to be CANCELED or FAILED.
    4. FailoverCoordinator has server state, RUNNING, FAILING, CANCELLING, FAILED, CANCELED.

    5. FailoverRegion contains the minimal pipelined connected executions and manager the
failover of them.
    7. One FailoverRegion may be the succeeding or preceding of others. When a preceding region
failover, its all succeedings should failover too. And the succeedings should just reset its
executions and wait for the preceding to start it when preceding finish. Preceding should
wait for its succeedings to be CREATED and then schedule again.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shuai-xu/flink jira-4256

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3539
commit 363f1536838064edbdd5f39e41f3f19f6c511fc4
Author: shuai.xus <shuai.xus@alibaba-inc.com>
Date:   2017-03-15T03:36:11Z

    [FLINK-4256] Flip1: fine gained recovery


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message