reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Commented] (REEF-1399) Node stuck in group communication failure case
Date Mon, 23 May 2016 21:57:12 GMT


Markus Weimer commented on REEF-1399:

I'm with Dhruv on this. Also, I am equally ignorant to network programming. But real time
units have a way of messing with us in the long run, when machines are unusually slow or fast
compared to our expectations. Time-outs on the worker nodes also mess with the REEF promise
of a centralized control flow. Hence, I strongly favor putting the Driver in charge of shutting
down Tasks that need to be shut down.

> Node stuck in group communication failure case
> ----------------------------------------------
>                 Key: REEF-1399
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Julia
>              Labels: FT
> Currently, in the group communication, if one of the task fails, all the other tasks
are waiting forever, that could easily cause leak as those tasks are running in separate threads.

> There are two ways to resolve it:
> 1. Add time out in the blocking call in GC. After waiting for longer enough and still
not able to receive any message, throw Group Communication exception. 
> 2. Depend on fault tolerant to let driver to send close event to those tasks, when the
task is not iterating and hung, after a timeout, enforce the task to close by throwing exception.

> We will do the second in any case. Question is shall we do the first one? 

This message was sent by Atlassian JIRA

View raw message