reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1345) Throw proper exceptions in IMRU Task
Date Tue, 24 May 2016 00:56:13 GMT

    [ https://issues.apache.org/jira/browse/REEF-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297458#comment-15297458
] 

Julia commented on REEF-1345:
-----------------------------

We have a blocking call toread data in group communication. The issue was discussed in REEF-1392.
We need to make a call for if we would like to throw exception in this case. There are two
options:

1. Leave it as it is today. In this case, the task would hung if one of the task in the communication
group fails. Then we must depend on close event handler to enforce to close the task in fault
tolerant case so that we can resubmit the task again on the same Evaluator. 
2. Pass timeout when trying to Take data from the message queue in calling its API. We can
make this time out pretty long but at least it won't hung forever. If we can not get data
eventually, we will throw TaskGroupCommunicaiton exception.  

I prefer the second one at least it would avoid resource leak if Evaluator is not killed.
[~markus.weimer][~dkm2110], let me know what do you think. 

> Throw proper exceptions in IMRU Task
> ------------------------------------
>
>                 Key: REEF-1345
>                 URL: https://issues.apache.org/jira/browse/REEF-1345
>             Project: REEF
>          Issue Type: Task
>            Reporter: Julia
>              Labels: FT
>
> For IMRU fault tolerant, we need to identify where to throw proper exceptions with error
messages in places where exception may happen. It includes: 
> TaskFailByCommunication - if there is any error caused by group communication, typical
case is when a task is not able to get messages from its children, this exception should be
thrown .
> TaskFiledByAppError - catch possible application error and throw the corresponding excretions
in those cases. 
> TaskFailedBySystem - any possible system error that could crash the task such as memory,
hard disk, file access, network, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message