reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <>
Subject [jira] [Commented] (REEF-1399) Node stuck in group communication failure case
Date Fri, 20 May 2016 21:49:12 GMT


Julia commented on REEF-1399:

CancellationToken essentially is a way to do timeout. Wake layer has its own CancellationToken
and I don't think at this moment we want to make big operation in WAKE and GC unless we have
a very good understanding for what we are doing. 
CancellationToken is usually used in async calls. For BlockingCollection, it has API for the
user to specificity the time out. Technically it is not a problem. 
As the read at Wake layer does have time out in reading (uses CancellationToken ), that means
at GC layer, we should also set some timeout instead of waiting forever in reading. 
In current IMRU Driver, In FailedEvaluator, we don't have code to dispose rest of the active
contexts. In failed task, we didn't dispose them either but throw an exception. In my local
tests, I can clearly see in failure case, even driver stopped and test completed, the evaluators
and tasks are keep running forever. May be RM clears the resources in YARN env. 

> Node stuck in group communication failure case
> ----------------------------------------------
>                 Key: REEF-1399
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Julia
>              Labels: FT
> Currently, in the group communication, if one of the task fails, all the other tasks
are waiting forever, that could easily cause leak as those tasks are running in separate threads.

> There are two ways to resolve it:
> 1. Add time out in the blocking call in GC. After waiting for longer enough and still
not able to receive any message, throw Group Communication exception. 
> 2. Depend on fault tolerant to let driver to send close event to those tasks, when the
task is not iterating and hung, after a timeout, enforce the task to close by throwing exception.

> We will do the second in any case. Question is shall we do the first one? 

This message was sent by Atlassian JIRA

View raw message