Mailing-List: contact dev-help@reef.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@reef.apache.org
Date: Wed, 31 Aug 2016 17:57:20 +0000 (UTC)
From: "Dhruv Mahajan (JIRA)" <jira@apache.org>
To: dev@reef.apache.org
Message-ID: <JIRA.12949314.1457731216000.460322.1472666240781@Atlassian.JIRA>
In-Reply-To: <JIRA.12949314.1457731216000@Atlassian.JIRA>
References: <JIRA.12949314.1457731216000@Atlassian.JIRA> <JIRA.12949314.1457731216255@arcas>
Subject: [jira] [Comment Edited] (REEF-1251) IMRU Driver handlers for Fault
 Tolerant
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Wed, 31 Aug 2016 17:57:38 -0000


    [ https://issues.apache.org/jira/browse/REEF-1251?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D15452=
860#comment-15452860 ]=20

Dhruv Mahajan edited comment on REEF-1251 at 8/31/16 5:57 PM:
--------------------------------------------------------------

[~juliaw] [~MariiaMykhailova][~markus.weimer] [~andreym] Looking at the cur=
rent PR it seems there are separate multiple PRs it can be divided in to. H=
owever, I see the concern that testing by putting these as seprate JIRas wi=
ll become a daunting task. Hence, it might be ok to merge it as one off big=
 PR. However, I would like to lay down concerns and shortcoming and that we=
 are on the same page that these need to be addressed to make IMRU Ft pract=
ically usable.

# Currently, if the evaluator fails while we are still in the phase of task=
 submission phase, we will have an issue where the newly created tasks will=
 wait for a long time in {{WaitForRegistration}} in Group communication ini=
tialization before getting cancelled. Two ways this can be handled:
#* A proper let's get it over with way would be to let driver handle this r=
egistration and synchronization mechanism. Having it at central place will =
solve lots of issues for us. The {{GroupCommDriver}} service would need to =
be extended and addition event handler would need to be binded so that when=
 driver received the messages from context that GroupComm. service is ready=
 it can start tasks. This also means having an additional context later in =
IMRU FT.
#* A less optimal and less preferrable way would be to pass the {{WaitForRe=
gistration}} some sort of cancellation token or bool sort of variable that =
it can check after every retry that whether it needs to come out. The task =
on receiving the close signal can then simply set this boolean to true. Wil=
l this work? One question: if we are in the constructor of the task, i.e. d=
river still has not got an {{IRunningTask}}, is there a way to send close s=
ignal and act on it, I guess not. If not, then I would seriously suggest wo=
rking on the first above.
# The whole exception handling in {{*TaskHost}} look very convoluted to me.=
 It seems they were put after testing a lot on cluster and observing except=
ions we encountered. What if we encounter anew sort of exception? I underst=
and this is a trickier problem in general and I propose to simplify it. The=
re can be multiple sources of exceptions or failures : a) Bug in base REEF,=
 b) Bug in IMRU, c) Bug in user's map and update tasks, and d) Bug in group=
 communication since codecs provided by user are buggy or he forgot to prov=
ide one. Can't we have a simple logic where all failures are recoverable? a=
), b) should not happen in any case since those are REEF bugs and we should=
 not run IMRU FT while they exist. For c) and d) responsibility lies with u=
ser and it's ok to do weird things there. Infact, Hadoop Map-reduce also do=
es that. There can be another issue where cluster itself is doing weird thi=
ngs beyond a)-d) although I can not think what. Then in any case we can't d=
o much.

Thoughts?


was (Author: dkm2110):
[~juliaw] [~MariiaMykhailova][~markus.weimer] [~andreym] Looking at the cur=
rent PR it seems there are separate multiple PRs it can be divided in to. H=
owever, I see the concern that testing by putting these as seprate JIRas wi=
ll become a daunting task. Hence, it might be ok to merge it as one off big=
 PR. However, I would like to lay down concerns and shortcoming and that we=
 are on the same page that these need to be addressed to make IMRU Ft pract=
ically usable.

# Currently, if the evaluator fails while we are still in the phase of task=
 submission phase, we will have an issue where the newly created tasks will=
 wait for a long time in {{WaitForRegistration}} in Group communication ini=
tialization before getting cancelled. Two ways this can be handled:
#* A proper let's get it over with way would be to let driver handle this r=
egistration and synchronization mechanism. Having it at central place will =
solve lots of issues for us. The {{GroupCommDriver}} service would need to =
be extended and addition event handler would need to be binded so that when=
 driver received the messages from context that GroupComm. service is ready=
 it can start tasks. This also means having an additional context later in =
IMRU FT.
#* A less optimal and less preferrable way would be to pass the {{WaitForRe=
gistration}} some sort of cancellation token or bool sort of variable that =
it can check after every retry that whether it needs to come out. The task =
on receiving the close signal can then simply set this boolean to true. Wil=
l this work? One question: if we are in the constructor of the task, i.e. d=
river still has not got an {{IRunningTask}}, is there a way to send close s=
ignal and act on it, I guess not. If not, then I would seriously suggest wo=
rking on the first above.
# The whole exception handling in {{*TaskHost}} look very convoluted to me.=
 It seems they were put after testing a lot on cluster and observing except=
ions we encountered. What if we encounter anew sort of exception? I underst=
and this is a trickier problem in general and I propose to simplify it. The=
re can be multiple sources of exceptions or failures : a) Bug in base REEF,=
 b) Bug in IMRU, c) Bug in user's map and update tasks, and d) Bug in group=
 communication since codecs provided by user are buggy or he forgot to prov=
ide one. Can't we have a simple logic where all failures are recoverable? a=
), b) should not happen in any case since those are REEF bugs and we should=
 not run IMRU FT while they exist. For c) and d) responsibility lies with u=
ser and it's ok to do weird things there. Infact, Hadoop Map-reduce also do=
es that. There can be another issue where cluster itself is doing weird thi=
ngs beyond a)-d) although I can not thing what. Then in any case we can't d=
o much.

Thoughts?


> IMRU Driver handlers for Fault Tolerant
> ---------------------------------------
>
>                 Key: REEF-1251
>                 URL: https://issues.apache.org/jira/browse/REEF-1251
>             Project: REEF
>          Issue Type: Task
>          Components: REEF.NET, REEF.NET Evaluator
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>
> Handles communications between driver and evaluators for evaluator and ta=
sk recovery when some evaluators fail. The following describe a flow for an=
 example:
> Here is the control flow in normal scenario:
> a.=09All the task, context and task status information is maintained in T=
ask Manager when tasks are created at the first time
> b.=09Task1, task2, Task3 s are queued in Task Starter=20
> c.=09When all tasks in a group is ready, tasks are submitted
> d.=09When tasks start running, task status is updated in Task Manager
> e.=09Evaluator 3 failed=20
> f.=09Driver received failed evaluator event and report it to Evaluator Ma=
nager
> g.=09Task Manager update task status to set task3 as failed
> h.=09Driver send message to task1 and task2 to stop them and update task =
status in Task Manager
> i.=09Driver request a new evaluator3=E2=80=99 for failed evaluator and su=
bmit a new context3=E2=80=99 for it and add a new task3=E2=80=99 to the que=
ue
> j.=09Driver recreate task1=E2=80=99 and task2=E2=80=99 with existing cont=
ext1 and context2 add them to the queue
> k.=09When all the new tasks in the communication group are ready, start t=
asks as in step c.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)