reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <>
Subject [jira] [Resolved] (REEF-1399) Node stuck in group communication failure case
Date Sat, 16 Jul 2016 00:39:20 GMT


Julia resolved REEF-1399.
       Resolution: Fixed
    Fix Version/s: 0.16

Fixed via

> Node stuck in group communication failure case
> ----------------------------------------------
>                 Key: REEF-1399
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>             Fix For: 0.16
> Currently, in the group communication, if one of the task fails, all the other tasks
are waiting forever, that could easily cause leak as those tasks are running in separate threads.

> There are two ways to resolve it:
> 1. Add time out in the blocking call in GC. After waiting for longer enough and still
not able to receive any message, throw Group Communication exception. 
> 2. Depend on fault tolerant to let driver to send close event to those tasks, when the
task is not iterating and hung, after a timeout, enforce the task to close by throwing exception.

> We will do the second in any case. Question is shall we do the first one? 

This message was sent by Atlassian JIRA

View raw message