flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1953) Rework Checkpoint Coordinator
Date Mon, 04 May 2015 23:17:06 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527536#comment-14527536
] 

ASF GitHub Bot commented on FLINK-1953:
---------------------------------------

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/651

    [FLINK-1953] [runtime] Integrate new snapshot checkpoint coordinator with jobgraph and
execution graph

    The core commit is https://github.com/apache/flink/commit/abd5ac7d78c5231e95bbbaaf15dad8f8c83221f9,

    
    This builds on top of the reworked Task from #648 
    Also adds a bunch of unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink checkpointing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/651.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #651
    
----
commit 60d1e141d625f4e431e9cda7a2fc246a25d8816a
Author: Stephan Ewen <sewen@apache.org>
Date:   2015-04-30T20:05:27Z

    [streaming] New Source and state checkpointing interfaces that allow operations to interact
with the state checkpointing in a more precise manner.

commit 04969b380cdafaa1d63dbb2740b6092d237dbaab
Author: Stephan Ewen <sewen@apache.org>
Date:   2015-05-02T23:15:39Z

    [FLINK-1968] [runtime] Clean up and improve the distributed cache.
    
     - Gives a proper exception when a non-cached file is accessed
     - Forwards I/O exceptions that happen during file transfer, rather than only returning
null when transfer failed
     - Consistently keeps reference counts and copies only when needed
     - Properly removes all files when shutdown
     - Uses a shutdown hook to remove files when process is killed

commit bba8504c125c1f81c448cb2d4a6fbad7e79f4e7e
Author: Stephan Ewen <sewen@apache.org>
Date:   2015-05-02T23:57:37Z

    [runtime] Fix TaskExecutionState against non-serializable exceptions.

commit 9b9594a7569ed01dcfc97a82880938c033171bec
Author: Stephan Ewen <sewen@apache.org>
Date:   2015-05-03T02:41:03Z

    [FLINK-1672] [runtime] Unify Task and RuntimeEnvironment into one class.
    
     - This simplifies and hardens the failure handling during task startup
     - Guarantees that no actor system threads are blocked by task bootstrap, or task canceling
     - Corrects some previously erroneous corner case state transitions
     - Adds simple and robust tests

commit 3e4ed4e9e6492fa2d06892dc42c491125f32ad98
Author: Stephan Ewen <sewen@apache.org>
Date:   2015-05-03T12:10:35Z

    [FLINK-1969] [runtime] Remove deprecated profiler code

commit 5da3a5d5b19414ef794e8e6f8e6a3c77c613ffce
Author: Stephan Ewen <sewen@apache.org>
Date:   2015-05-03T12:10:58Z

    Update build target path in README.md

commit abd5ac7d78c5231e95bbbaaf15dad8f8c83221f9
Author: Stephan Ewen <sewen@apache.org>
Date:   2015-04-30T17:59:36Z

    [FLINK-1953] [runtime] Integrate new snapshot checkpoint coordinator with jobgraph and
execution graph

commit ef3fd5de4fa414d41e451892219a6716ada3c036
Author: Stephan Ewen <sewen@apache.org>
Date:   2015-05-04T22:26:05Z

    [FLINK-1973] [jobmanager] Task execution state messages are logged on INFO level, rather
than on DEBUG level

----


> Rework Checkpoint Coordinator
> -----------------------------
>
>                 Key: FLINK-1953
>                 URL: https://issues.apache.org/jira/browse/FLINK-1953
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 0.9
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>             Fix For: 0.9
>
>
> The checkpoint coordinator currently contains no tests and is vulnerable to a variety
of situations. In particular, I propose to add:
>  - Better configurability which tasks receive the trigger checkpoint messages, which
tasks need to acknowledge the checkpoint, and which tasks need to receive confirmation messages.
>  - checkpoint timeouts, such that incomplete checkpoints are guaranteed to be cleaned
up after a while, regardless of successful checkpoints
>  - better sanity checking of messages and fields, to properly handle/ignore messages
for old/expired checkpoints, or invalidly routed messages
>  - Better handling of checkpoint attempts at points where the execution has just failed
is is currently being canceled.
>  - Add a good set of tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message