reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Created] (REEF-1399) Node stuck in group communication failure case
Date Fri, 20 May 2016 19:10:12 GMT
Julia created REEF-1399:
---------------------------

             Summary: Node stuck in group communication failure case
                 Key: REEF-1399
                 URL: https://issues.apache.org/jira/browse/REEF-1399
             Project: REEF
          Issue Type: Bug
            Reporter: Julia


Currently, in the group communication, if one of the task fails, all the other tasks are waiting
forever, that could easily cause leak as those tasks are running in separate threads. 
There are two ways to resolve it:
1. Add time out in the blocking call in GC. After waiting for longer enough and still not
able to receive any message, throw Group Communication exception. 
2. Depend on fault tolerant to let driver to send close event to those tasks, when the task
is not iterating and hung, after a timeout, enforce the task to close by throwing exception.

We will do the second in any case. Question is shall we do the first one? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message