reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1725) IMRU Job fails when UpdateTask is done but another evaluator fails at the same time causing system state change to ShutDown
Date Fri, 27 Jan 2017 03:04:24 GMT

    [ https://issues.apache.org/jira/browse/REEF-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840745#comment-15840745
] 

Julia edited comment on REEF-1725 at 1/27/17 3:04 AM:
------------------------------------------------------

It is possible to leverage the return value to signal the difference.

Return value of Call() is for client to return the result. We cannot override the value to
represent "update task is done". However, we can use it to represent un-completed cases, like
CloseByDriver. We can then check the result at evalautor side and put this case into one of
the task state, like KILLED or FAILED. 
* if we leverage KILLED state, at Java side, it doesn't trigger TaskEvent.
* if we use Failed, it will become FailedTask, that requires task exception information. It
will end up the same as we throw a task exception at task side at first place, and that code
path has been tested. 
* If we introduce a new task state, then we need to add a new TaskEvent, and modify dispatcher,
bridge, etc...

I feel treat this case as task exception is the easiest and it IS an exception and should
not be returned same as TaskCompletion. Exception type/message can tell the driver what caused
the exception. It can be used to distinguish different cases in stead of adding states. 

Extend KILLED can be another option, but need to add one more Task event in dispatcher and
bridge, and it will impact more places including task state transition in IMRU. 


was (Author: juliaw):
It is possible to leverage the return value to signal the difference.

Return value of Call() is for client to return the result. We cannot override the value to
represent "update task is done". However, we can use it to represent un-completed cases, like
CloseByDriver. We can then check the result at evalautor side and put this case into one of
the task state, like KILLED or FAILED. 
* if we leverage KILLED state, at Java side, it doesn't trigger TaskEvent.
* if we use Failed, it will become FailedTask, that requires task exception information. It
will end up the same as we throw a task exception at task side at first place, and that code
path has been tested. 
* If we introduce a new task state, then we need to add a new TaskEvent, and modify dispatcher,
bridge, etc...

I feel treat this case as task exception is the easiest and it IS an exception and should
not be returned same as TaskCompletion. Adding another state may be more precise, but more
work. 

> IMRU Job fails when UpdateTask is done but another evaluator fails at the same time causing
system state change to ShutDown
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: REEF-1725
>                 URL: https://issues.apache.org/jira/browse/REEF-1725
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Julia
>            Assignee: Julia
>            Priority: Critical
>
> Currently in IMRU fault tolerant system, when the master task is done, at the same time
the task receives Close event caused by some other evaluator failures, the system state is
changed to ShutDown, causing retry again. 
> In fact, as soon as master is done, this event should be clearly passed to driver and
the driver should execute DoneAction no matter if there is any other failures happen at the
same time. 
> There are multiple possible solutions:
> 1.	Let CompletedTask to carry “done” information – The major issue for this solution
is not just the complexity of updating proto buffer message and both Java and C# code, the
issue is task needs to have a way to let TaskRuntime know it is “done”. For that, we need
to change ITask interface which is something we should be careful not to change unless it
is really necessary. 
> 2.	Use task massage – this is simple to implement. However task message is sent with
heartbeat for “running task”. If the task status is changed to close before the heartbeat
is sent, this message won’t be sent out to driver. 
> 3.	Send different events for Update task COMPLETE and CLOSE. Currently no matter update
task is really done or close by driver, ITask.Call() is returned and ICompletedTask is sent.
If we only send ICompletedTask when the task is really done no matter what other things happen,
and send IFailedTask if the Update task is closed by driver and the task is not “done”,
then driver will be able to differentiate those two events.  This is an easier and quicker
solution. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message