reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1492) On IMRU recovery: if ResultHandler.Dispose() throws exception, IMRU Driver hangs.
Date Fri, 09 Dec 2016 05:28:58 GMT

    [ https://issues.apache.org/jira/browse/REEF-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734115#comment-15734115
] 

Julia edited comment on REEF-1492 at 12/9/16 5:28 AM:
------------------------------------------------------

Currently we call ResultHandler.Dispose in finally block in TaskHost.Call() and current implementation
of ResultHandler copies local file from remote in its Dispose() method. As  exceptions in
task Call() can happen any time, and task close event can come at any stage, in Dispose of
ResultHandler , there might be no result yet, or local file may not be created. etc. So very
possibly some exceptions will be thrown in the Dispose() of ResultHandler. 

This kind of exception should be caught by TaskRuntime and eventually send back to driver.
However, the call to ResultHandler.Dispose() is before SignalTaskStopped in TaskHostBase.
So when an exception happens in ResultHandler.Dispose (), we will miss the call to SignalTaskStopped
that may cause something hung. 

What I would suggest is, 
1. We should not put a lot of logic in Dispose method. It should be release resource only.
Coping result local data file to remote file should be in ResultHandler.HandleResult() method.
This method is called only when there is a result. I would assume this method is only called
once at the end of the iteration. [~dkm2110] please let me know if that is not the case. 
2. We should catch exception when calling FinallyBlock() in the TaskHost which calls ResultHandler.Dispose().
If there is no complex logic in the Dispose() method, the chance of failure should be low.
If we really cannot release some resource in dispose method, it should result in FailedEvaluator.
As it is master, so no recovery.
3. Add another layer of finally for FinallyBlock() to call SignalTaskStopped in TaskHostBase
to ensure the task close event handler is returned. 




was (Author: juliaw):
Currently we call ResultHandler.Dispose in finally block in TaskHost and current implementation
of ResultHandler copies local file from remote in its Dispose() method. As  exceptions in
task can happen any time, or close event can come at any stage, in Dispose of ResultHandler
, there might be no result yet, or local file may not be created. etc. So very possibly exception
will be thrown. 

This exception should be caught by TaskRuntime and eventually send back to driver. However,
this call is before SignalTaskStopped in TaskHost base. So when exception happens in ResultHandler.Dispose
(), we will miss the call to SignalTaskStopped that may cause something hung. 

What I would suggest is, 
1. Coping result local data file to remote should be in ResultHandler.HandleResult() method.
This method is called only when there is result. I would assume this method only called once
at the end of the iteration. [~dkm2110] please let me know if that is not the case. We should
not put a lot of logic in Dispose method. It should be release resource only. 
2. We should catch exception when calling FinallyBlock() which calls Dispose() in the TaskHost.
If there is no complex logic in Dispose() method, the chance of failure should be low. If
we really cannot release some resource in dispose method, it should result in FailedEvaluator.
As it is master, so no recovery.
3. Add another layer of finally for FinallyBlock() to call SignalTaskStopped in TaskHostBase
to ensure the task close event handler is returned. 



> On IMRU recovery: if ResultHandler.Dispose() throws exception, IMRU Driver hangs.
> ---------------------------------------------------------------------------------
>
>                 Key: REEF-1492
>                 URL: https://issues.apache.org/jira/browse/REEF-1492
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF
>            Reporter: Andrey
>              Labels: FT
>
> IMRU scenario:
> - one of the map tasks fails
> - Driver triggers shutdown on all tasks 
> - UpdateTaskHost on shutdown is calling ResultHandler.Dispose()
> - resulthandler (in my case WriteResultHandler) throws exeption because there are no
results (Update function was never executed)
> There are couple questions here:
> - WriteResulthandler should handle [no results] situation more gracefully,  especially
on Dispose() 
> Probably logic of copy file should be moved from Dispose() to HandleResult() function.
> - UpdateTaskHost should handle exceptions from Dispose() call....result handler can be
provided by client, so code can throw.
> In case of Dispose() failure the UpdateTaskHost should probably trigger non-recoverable
failure, which in turn triggers Driver failure  (right now driver  hangs)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message