flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2472) Make the JobClientActor check periodically if the submitted Job is still running and if the JobManager is still alive
Date Wed, 05 Aug 2015 11:46:04 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655237#comment-14655237
] 

ASF GitHub Bot commented on FLINK-2472:
---------------------------------------

Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/979#issuecomment-127968729
  
    I've added a few more message handlers:
    1. We never miss a `RUNNING` state between restarts.
    2. There is a timeout for repeatedly getting `CANCELED/.ING` or `FAILED/ING` messages.
    
    Further, I worked around the `receiveTimeout` bug(?) that a timeout message might be enqueued
even if we just received a message. This is done by putting a tolerance limit of 0.1 times
the `JOB_MANAGER_TIMEOUT` and maintaining the last ping from the `JobManager`.
    
    @tillrohrmann , could you look this over again? Lemme know if there are still unhandled
cases.


> Make the JobClientActor check periodically if the submitted Job is still running and
if the JobManager is still alive
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-2472
>                 URL: https://issues.apache.org/jira/browse/FLINK-2472
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Till Rohrmann
>            Assignee: Sachin Goel
>
> In case that the {{JobManager}} dies without notifying possibly connected {{JobClientActors}}
or if the job execution finishes without sending the {{SerializedJobExecutionResult}} back
to the {{JobClientActor}}, it might happen that a {{JobClient.submitJobAndWait}} never returns.
> I propose to let the {{JobClientActor}} periodically check whether the {{JobManager}}
is still alive and whether the submitted job is still running. If not, then the {{JobClientActor}}
should return an exception to complete the waiting future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message