reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (REEF-1399) Node stuck in group communication failure case
Date Sat, 16 Jul 2016 00:39:20 GMT

     [ https://issues.apache.org/jira/browse/REEF-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julia resolved REEF-1399.
-------------------------
       Resolution: Fixed
    Fix Version/s: 0.16

Fixed via https://github.com/apache/reef/pull/1052

> Node stuck in group communication failure case
> ----------------------------------------------
>
>                 Key: REEF-1399
>                 URL: https://issues.apache.org/jira/browse/REEF-1399
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>             Fix For: 0.16
>
>
> Currently, in the group communication, if one of the task fails, all the other tasks
are waiting forever, that could easily cause leak as those tasks are running in separate threads.

> There are two ways to resolve it:
> 1. Add time out in the blocking call in GC. After waiting for longer enough and still
not able to receive any message, throw Group Communication exception. 
> 2. Depend on fault tolerant to let driver to send close event to those tasks, when the
task is not iterating and hung, after a timeout, enforce the task to close by throwing exception.

> We will do the second in any case. Question is shall we do the first one? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message