reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1407) Catching exceptions in group communication in failure case
Date Thu, 26 May 2016 23:36:12 GMT

    [ https://issues.apache.org/jira/browse/REEF-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303167#comment-15303167
] 

Julia commented on REEF-1407:
-----------------------------

This is the continue the discussion of REEF-1345, where our discussions did not converge.


Issue description: when one of the tasks has some issue and stops running in IMRU, other IMRU
tasks should throw group communication exceptions instead of hung. 

Code follow: here is the code flow when group commutation error happens:
IMRUTask.Call()->BroadcastReceiver.Receive()->OperatorTopology.ReceiveFromNode()->NodeStruct.GetData()-._messageQueue.Take()

If there is no data, the call is blocking. 

A solution is to use TryTake() instead of Take(). If we cannot get data after timeout, throw
exception and propagate the exception to IMRU task. In IMRU task, if the task doesn't receive
Close event, it can retry to take the data again. In this way, we don't loose anything for
the case like machine is slow, but also respect the close event coming from the driver. 

Another proposed solution is to catch underneath network exception. As network is on different
thread, currently we cannot catch network exceptions in task. There are ways to catch exceptions
in separate thread in C#. However, to work on our case, we need to have very good understanding
to the Network/WAKE layers and the thread flow, and it may also involve many API changes in
those libs. As it is not trivial work, my concern is it would block our fault tolerant progress.






> Catching exceptions in group communication in failure case
> ----------------------------------------------------------
>
>                 Key: REEF-1407
>                 URL: https://issues.apache.org/jira/browse/REEF-1407
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Julia
>              Labels: FT
>
> Currently when a task fails, other tasks in the group are stuck in reading data by a
blocking call. We should be able to try and throw an exception and propagate the exception
to Task so that the task can handle it in a proper way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message