reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1492) On IMRU recovery: if ResultHandler.Dispose() throws exception, IMRU Driver hangs.
Date Fri, 09 Dec 2016 03:01:07 GMT

    [ https://issues.apache.org/jira/browse/REEF-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734115#comment-15734115
] 

Julia commented on REEF-1492:
-----------------------------

Currently we call ResultHandler.Dispose in finally block of TaskHost and TLCPlusPlus current
implementation of ResultHandler copies local file from remote in its Dispose() method. As
the exception in task can happen any time, or close event can be sent at any stage, in Dispose
of ResultHandler , there might be no result yet, local file may not be created. etc. So very
possibly exception will be thrown. 

This exception should be caught by TaskRuntime and eventually send back to driver. However,
this call is before SignalTaskStopped in TaskHost base. So when exception happens in ResultHandler.Dispose
(), we will miss the call to SignalTaskStopped that may cause something hung. 

What I would suggest is, 
1. Coping result local data file to remote should be in ResultHandler.HandleResult() method.
This method is called only when there is result. I would assume this method only called once
at the end of the iteration. [~dkm2110] please let me know if that is not the case. We should
not put a lot of logic in Dispose method. It should be release resource only. 
2. We should catch exception when calling FinallyBlock() which calls Dispose() in the TaskHost.
If there is no complex logic in Dispose() method, the chance of failure should be low. If
we really cannot release some resource in dispose method, it should result in FailedEvaluator.
As it is master, so no recovery.
3. Add another layer of finally for FinallyBlock() to call SignalTaskStopped in TaskHostBase
to ensure the task close event handler is returned. 



> On IMRU recovery: if ResultHandler.Dispose() throws exception, IMRU Driver hangs.
> ---------------------------------------------------------------------------------
>
>                 Key: REEF-1492
>                 URL: https://issues.apache.org/jira/browse/REEF-1492
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF
>            Reporter: Andrey
>              Labels: FT
>
> IMRU scenario:
> - one of the map tasks fails
> - Driver triggers shutdown on all tasks 
> - UpdateTaskHost on shutdown is calling ResultHandler.Dispose()
> - resulthandler (in my case WriteResultHandler) throws exeption because there are no
results (Update function was never executed)
> There are couple questions here:
> - WriteResulthandler should handle [no results] situation more gracefully,  especially
on Dispose() 
> Probably logic of copy file should be moved from Dispose() to HandleResult() function.
> - UpdateTaskHost should handle exceptions from Dispose() call....result handler can be
provided by client, so code can throw.
> In case of Dispose() failure the UpdateTaskHost should probably trigger non-recoverable
failure, which in turn triggers Driver failure  (right now driver  hangs)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message