reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruv Mahajan (JIRA)" <>
Subject [jira] [Commented] (REEF-1399) Node stuck in group communication failure case
Date Fri, 20 May 2016 20:28:13 GMT


Dhruv Mahajan commented on REEF-1399:

These time outs are always tricky and I am not in favor of that (Although I am also not an
expert in networking too to say it strongly). Can we use Cancellation tokens for this purpose.
Whenever a new connection is requested or created, we also pass the cancellation token from
upstream user application (GC) to the correpsonding StreaminkLink invocation. Now once we
catch exceptions in StreamingLink/streams we can cancel these tokens. This way upstream process
knows that something bad happened and can throw an exception.

Also, I do not see this happening in yarn. In case of network failures the containers were
always given back.

> Node stuck in group communication failure case
> ----------------------------------------------
>                 Key: REEF-1399
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Julia
>              Labels: FT
> Currently, in the group communication, if one of the task fails, all the other tasks
are waiting forever, that could easily cause leak as those tasks are running in separate threads.

> There are two ways to resolve it:
> 1. Add time out in the blocking call in GC. After waiting for longer enough and still
not able to receive any message, throw Group Communication exception. 
> 2. Depend on fault tolerant to let driver to send close event to those tasks, when the
task is not iterating and hung, after a timeout, enforce the task to close by throwing exception.

> We will do the second in any case. Question is shall we do the first one? 

This message was sent by Atlassian JIRA

View raw message