flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijiang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-5501) Determine whether the job starts from last JobManager failure
Date Wed, 18 Jan 2017 03:16:26 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827361#comment-15827361
] 

Zhijiang Wang edited comment on FLINK-5501 at 1/18/17 3:16 AM:
---------------------------------------------------------------

Thank you for the quick response!

Yeah, you already considered all the feasible alternatives to implement this goal and I totally
agreed with that.

1. For extending the leader election service, I also thought of this way before implementation.
For currently {{ZookeeperLeaderElectionService}}, the leader node is EPHEMERAL type, if the
incrementing number is carried in this node, it should be changed to PERSISTENT type, otherwise
there should add another node for incrementing number. This way is very similar with by {{RunningJobsRegistry}},
from semantic aspect, {{LeaderElectionService}} may be more suitable. But from minimum change
aspect, I already implemented that by {{RunningJobsRegistry}}.

2. Actually I did not think of this way before, and it is an total different idea and interesting.
The {{TaskManager}} is aware of {{JobManager}} leader change and will be re-register the new
leader after changed. So the {{JobManager}} can resort to the registration process to determine
the status.
But it may be complicated to coordinate between common schedule and reconciling, because they
will be triggered at the same time. And also it will bring more resource waste temporarily.
If the JobManager can determine the status after startup in an easy way, it can do the specific
process and no need to do ambiguous thing.

In summary, I prefer the first way to implement the goal. And the whole {{JobManager}} failure
feature has been finished in my side, could I submit the pull request for this issue based
on {{RunningJobsRegistry}} implementation?


was (Author: zjwang):
Thank you for the quick response!

Yeah, you already considered all the feasible alternatives to implement this goal and I totally
agreed with that.

1. For extending the leader election service, I also thought of this way before implementation.
For currently {{ZookeeperLeaderElectionService}}, the leader node is EPHEMERAL type, if the
incrementing number is carried in this node, it should be changed to PERSISTENT type, otherwise
there should add another node for incrementing number. This way is very similar with by {{RunningJobsRegistry}},
from semantic aspect, {{LeaderElectionService}} may be more suitable. But from minimum change
aspect, I already implemented that by {{RunningJobsRegistry}}.

2. Actually I did not think of this way before, and it is an total different idea and interesting.
The {{TaskManager}} is aware of {{JobManager}} leader change and will be re-register the new
leader after changed. So the {{JobManager}} can resort to the registration process to determine
the status.
But it may be complicated to coordinate between common schedule and reconciling, between they
will be triggered at the same time. And also it will bring more resource waste temporarily.
If the JobManager can determine the status after startup in an easy way, it can do the specific
process and no need to do ambiguous thing.

In summary, I prefer the first way to implement the goal. And the whole {{JobManager}} failure
feature has been finished in my side, could I submit the pull request for this issue based
on {{RunningJobsRegistry}} implementation?

> Determine whether the job starts from last JobManager failure
> -------------------------------------------------------------
>
>                 Key: FLINK-5501
>                 URL: https://issues.apache.org/jira/browse/FLINK-5501
>             Project: Flink
>          Issue Type: Sub-task
>          Components: JobManager
>            Reporter: Zhijiang Wang
>            Assignee: Zhijiang Wang
>
> When the {{JobManagerRunner}} grants leadership, it should check whether the current
job is already running or not. If the job is running, the {{JobManager}} should reconcile
itself (enter RECONCILING state) and waits for the {{TaskManager}} reporting task status.
Otherwise the {{JobManger}} can schedule the {{ExecutionGraph}} in common way.
> The {{RunningJobsRegistry}} can provide the way to check the job running status, but
we should expand the current interface and fix the related process to support this function.
> 1. {{RunningJobsRegistry}} sets RUNNING status after {{JobManagerRunner}} granting leadership
at the first time.
> 2. If the job finishes, the job status will be set FINISHED by {{RunningJobsRegistry}}
and the status will be deleted before exit. 
> 3. If the mini cluster starts multi {{JobManagerRunner}}, and the leader {{JobManagerRunner}}
already finishes the job to set the job status FINISHED, other {{JobManagerRunner}} will exit
after grants the leadership again.
> 4. If the {{JobManager}} fails, the job status will be still in RUNNING. So if the {{JobManagerRunner}}
(the previous or new one) grants leadership again, it will check the job status and enters
{{RECONCILING}} state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message