Mailing-List: contact dev-help@reef.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@reef.apache.org
Date: Mon, 23 May 2016 21:57:12 +0000 (UTC)
From: "Markus Weimer (JIRA)" <jira@apache.org>
To: dev@reef.apache.org
Message-ID: <JIRA.12971625.1463771392000.270435.1464040632874@Atlassian.JIRA>
In-Reply-To: <JIRA.12971625.1463771392000@Atlassian.JIRA>
References: <JIRA.12971625.1463771392000@Atlassian.JIRA> <JIRA.12971625.1463771392522@arcas>
Subject: [jira] [Commented] (REEF-1399) Node stuck in group communication
 failure case
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 23 May 2016 21:57:19 -0000


    [ https://issues.apache.org/jira/browse/REEF-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297173#comment-15297173 ] 

Markus Weimer commented on REEF-1399:
-------------------------------------

I'm with Dhruv on this. Also, I am equally ignorant to network programming. But real time units have a way of messing with us in the long run, when machines are unusually slow or fast compared to our expectations. Time-outs on the worker nodes also mess with the REEF promise of a centralized control flow. Hence, I strongly favor putting the Driver in charge of shutting down Tasks that need to be shut down.

> Node stuck in group communication failure case
> ----------------------------------------------
>
>                 Key: REEF-1399
>                 URL: https://issues.apache.org/jira/browse/REEF-1399
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Julia
>              Labels: FT
>
> Currently, in the group communication, if one of the task fails, all the other tasks are waiting forever, that could easily cause leak as those tasks are running in separate threads. 
> There are two ways to resolve it:
> 1. Add time out in the blocking call in GC. After waiting for longer enough and still not able to receive any message, throw Group Communication exception. 
> 2. Depend on fault tolerant to let driver to send close event to those tasks, when the task is not iterating and hung, after a timeout, enforce the task to close by throwing exception. 
> We will do the second in any case. Question is shall we do the first one? 


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)