flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5193) Recovering all jobs fails completely if a single recovery fails
Date Wed, 30 Nov 2016 13:11:59 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15708551#comment-15708551
] 

ASF GitHub Bot commented on FLINK-5193:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/2909

    [FLINK-5193] [jm] Harden job recovery in case of recovery failures

    When recovering multiple jobs a single recovery failure caused all jobs to be not recovered.
    This PR changes this behaviour to make the recovery of jobs independent so that a single
    failure won't make the complete recovery fail. Furthermore, this PR improves the error
reporting
    for failures originating in the ZooKeeperSubmittedJobGraphStore.
    
    Add test case
    
    Fix failing JobManagerHACheckpointRecoveryITCase

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixJobRecoveryFailure

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2909.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2909
    
----
commit d61636d0465e0e0f274871a883d8d376c223a1f3
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2016-11-29T16:31:08Z

    [FLINK-5193] [jm] Harden job recovery in case of recovery failures
    
    When recovering multiple jobs a single recovery failure caused all jobs to be not recovered.
    This PR changes this behaviour to make the recovery of jobs independent so that a single
    failure won't stall the complete recovery. Furthermore, this PR improves the error reporting
    for failures originating in the ZooKeeperSubmittedJobGraphStore.
    
    Add test case
    
    Fix failing JobManagerHACheckpointRecoveryITCase

----


> Recovering all jobs fails completely if a single recovery fails
> ---------------------------------------------------------------
>
>                 Key: FLINK-5193
>                 URL: https://issues.apache.org/jira/browse/FLINK-5193
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.2.0, 1.1.3
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>             Fix For: 1.2.0, 1.1.4
>
>
> In HA case where the {{JobManager}} tries to recover all submitted job graphs, e.g. when
regaining leadership, it can happen that none of the submitted jobs are recovered if a single
recovery fails. Instead of failing the complete recovery procedure, the {{JobManager}} should
still try to recover the remaining (non-failing) jobs and print a proper error message for
the failed recoveries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message