Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CFD8B200B71 for ; Wed, 31 Aug 2016 19:57:37 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CAB28160AB5; Wed, 31 Aug 2016 17:57:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E9D0A160AA7 for ; Wed, 31 Aug 2016 19:57:21 +0200 (CEST) Received: (qmail 1960 invoked by uid 500); 31 Aug 2016 17:57:21 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 1556 invoked by uid 99); 31 Aug 2016 17:57:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2016 17:57:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id BF74A2C1B84 for ; Wed, 31 Aug 2016 17:57:20 +0000 (UTC) Date: Wed, 31 Aug 2016 17:57:20 +0000 (UTC) From: "Dhruv Mahajan (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (REEF-1251) IMRU Driver handlers for Fault Tolerant MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 31 Aug 2016 17:57:38 -0000 [ https://issues.apache.org/jira/browse/REEF-1251?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D15452= 860#comment-15452860 ]=20 Dhruv Mahajan edited comment on REEF-1251 at 8/31/16 5:57 PM: -------------------------------------------------------------- [~juliaw] [~MariiaMykhailova][~markus.weimer] [~andreym] Looking at the cur= rent PR it seems there are separate multiple PRs it can be divided in to. H= owever, I see the concern that testing by putting these as seprate JIRas wi= ll become a daunting task. Hence, it might be ok to merge it as one off big= PR. However, I would like to lay down concerns and shortcoming and that we= are on the same page that these need to be addressed to make IMRU Ft pract= ically usable. # Currently, if the evaluator fails while we are still in the phase of task= submission phase, we will have an issue where the newly created tasks will= wait for a long time in {{WaitForRegistration}} in Group communication ini= tialization before getting cancelled. Two ways this can be handled: #* A proper let's get it over with way would be to let driver handle this r= egistration and synchronization mechanism. Having it at central place will = solve lots of issues for us. The {{GroupCommDriver}} service would need to = be extended and addition event handler would need to be binded so that when= driver received the messages from context that GroupComm. service is ready= it can start tasks. This also means having an additional context later in = IMRU FT. #* A less optimal and less preferrable way would be to pass the {{WaitForRe= gistration}} some sort of cancellation token or bool sort of variable that = it can check after every retry that whether it needs to come out. The task = on receiving the close signal can then simply set this boolean to true. Wil= l this work? One question: if we are in the constructor of the task, i.e. d= river still has not got an {{IRunningTask}}, is there a way to send close s= ignal and act on it, I guess not. If not, then I would seriously suggest wo= rking on the first above. # The whole exception handling in {{*TaskHost}} look very convoluted to me.= It seems they were put after testing a lot on cluster and observing except= ions we encountered. What if we encounter anew sort of exception? I underst= and this is a trickier problem in general and I propose to simplify it. The= re can be multiple sources of exceptions or failures : a) Bug in base REEF,= b) Bug in IMRU, c) Bug in user's map and update tasks, and d) Bug in group= communication since codecs provided by user are buggy or he forgot to prov= ide one. Can't we have a simple logic where all failures are recoverable? a= ), b) should not happen in any case since those are REEF bugs and we should= not run IMRU FT while they exist. For c) and d) responsibility lies with u= ser and it's ok to do weird things there. Infact, Hadoop Map-reduce also do= es that. There can be another issue where cluster itself is doing weird thi= ngs beyond a)-d) although I can not think what. Then in any case we can't d= o much. Thoughts? was (Author: dkm2110): [~juliaw] [~MariiaMykhailova][~markus.weimer] [~andreym] Looking at the cur= rent PR it seems there are separate multiple PRs it can be divided in to. H= owever, I see the concern that testing by putting these as seprate JIRas wi= ll become a daunting task. Hence, it might be ok to merge it as one off big= PR. However, I would like to lay down concerns and shortcoming and that we= are on the same page that these need to be addressed to make IMRU Ft pract= ically usable. # Currently, if the evaluator fails while we are still in the phase of task= submission phase, we will have an issue where the newly created tasks will= wait for a long time in {{WaitForRegistration}} in Group communication ini= tialization before getting cancelled. Two ways this can be handled: #* A proper let's get it over with way would be to let driver handle this r= egistration and synchronization mechanism. Having it at central place will = solve lots of issues for us. The {{GroupCommDriver}} service would need to = be extended and addition event handler would need to be binded so that when= driver received the messages from context that GroupComm. service is ready= it can start tasks. This also means having an additional context later in = IMRU FT. #* A less optimal and less preferrable way would be to pass the {{WaitForRe= gistration}} some sort of cancellation token or bool sort of variable that = it can check after every retry that whether it needs to come out. The task = on receiving the close signal can then simply set this boolean to true. Wil= l this work? One question: if we are in the constructor of the task, i.e. d= river still has not got an {{IRunningTask}}, is there a way to send close s= ignal and act on it, I guess not. If not, then I would seriously suggest wo= rking on the first above. # The whole exception handling in {{*TaskHost}} look very convoluted to me.= It seems they were put after testing a lot on cluster and observing except= ions we encountered. What if we encounter anew sort of exception? I underst= and this is a trickier problem in general and I propose to simplify it. The= re can be multiple sources of exceptions or failures : a) Bug in base REEF,= b) Bug in IMRU, c) Bug in user's map and update tasks, and d) Bug in group= communication since codecs provided by user are buggy or he forgot to prov= ide one. Can't we have a simple logic where all failures are recoverable? a= ), b) should not happen in any case since those are REEF bugs and we should= not run IMRU FT while they exist. For c) and d) responsibility lies with u= ser and it's ok to do weird things there. Infact, Hadoop Map-reduce also do= es that. There can be another issue where cluster itself is doing weird thi= ngs beyond a)-d) although I can not thing what. Then in any case we can't d= o much. Thoughts? > IMRU Driver handlers for Fault Tolerant > --------------------------------------- > > Key: REEF-1251 > URL: https://issues.apache.org/jira/browse/REEF-1251 > Project: REEF > Issue Type: Task > Components: REEF.NET, REEF.NET Evaluator > Reporter: Julia > Assignee: Julia > Labels: FT > > Handles communications between driver and evaluators for evaluator and ta= sk recovery when some evaluators fail. The following describe a flow for an= example: > Here is the control flow in normal scenario: > a.=09All the task, context and task status information is maintained in T= ask Manager when tasks are created at the first time > b.=09Task1, task2, Task3 s are queued in Task Starter=20 > c.=09When all tasks in a group is ready, tasks are submitted > d.=09When tasks start running, task status is updated in Task Manager > e.=09Evaluator 3 failed=20 > f.=09Driver received failed evaluator event and report it to Evaluator Ma= nager > g.=09Task Manager update task status to set task3 as failed > h.=09Driver send message to task1 and task2 to stop them and update task = status in Task Manager > i.=09Driver request a new evaluator3=E2=80=99 for failed evaluator and su= bmit a new context3=E2=80=99 for it and add a new task3=E2=80=99 to the que= ue > j.=09Driver recreate task1=E2=80=99 and task2=E2=80=99 with existing cont= ext1 and context2 add them to the queue > k.=09When all the new tasks in the communication group are ready, start t= asks as in step c. -- This message was sent by Atlassian JIRA (v6.3.4#6332)