reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruv Mahajan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1251) IMRU Driver handlers for Fault Tolerant
Date Wed, 31 Aug 2016 17:57:20 GMT

    [ https://issues.apache.org/jira/browse/REEF-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452860#comment-15452860
] 

Dhruv Mahajan edited comment on REEF-1251 at 8/31/16 5:57 PM:
--------------------------------------------------------------

[~juliaw] [~MariiaMykhailova][~markus.weimer] [~andreym] Looking at the current PR it seems
there are separate multiple PRs it can be divided in to. However, I see the concern that testing
by putting these as seprate JIRas will become a daunting task. Hence, it might be ok to merge
it as one off big PR. However, I would like to lay down concerns and shortcoming and that
we are on the same page that these need to be addressed to make IMRU Ft practically usable.

# Currently, if the evaluator fails while we are still in the phase of task submission phase,
we will have an issue where the newly created tasks will wait for a long time in {{WaitForRegistration}}
in Group communication initialization before getting cancelled. Two ways this can be handled:
#* A proper let's get it over with way would be to let driver handle this registration and
synchronization mechanism. Having it at central place will solve lots of issues for us. The
{{GroupCommDriver}} service would need to be extended and addition event handler would need
to be binded so that when driver received the messages from context that GroupComm. service
is ready it can start tasks. This also means having an additional context later in IMRU FT.
#* A less optimal and less preferrable way would be to pass the {{WaitForRegistration}} some
sort of cancellation token or bool sort of variable that it can check after every retry that
whether it needs to come out. The task on receiving the close signal can then simply set this
boolean to true. Will this work? One question: if we are in the constructor of the task, i.e.
driver still has not got an {{IRunningTask}}, is there a way to send close signal and act
on it, I guess not. If not, then I would seriously suggest working on the first above.
# The whole exception handling in {{*TaskHost}} look very convoluted to me. It seems they
were put after testing a lot on cluster and observing exceptions we encountered. What if we
encounter anew sort of exception? I understand this is a trickier problem in general and I
propose to simplify it. There can be multiple sources of exceptions or failures : a) Bug in
base REEF, b) Bug in IMRU, c) Bug in user's map and update tasks, and d) Bug in group communication
since codecs provided by user are buggy or he forgot to provide one. Can't we have a simple
logic where all failures are recoverable? a), b) should not happen in any case since those
are REEF bugs and we should not run IMRU FT while they exist. For c) and d) responsibility
lies with user and it's ok to do weird things there. Infact, Hadoop Map-reduce also does that.
There can be another issue where cluster itself is doing weird things beyond a)-d) although
I can not think what. Then in any case we can't do much.

Thoughts?




was (Author: dkm2110):
[~juliaw] [~MariiaMykhailova][~markus.weimer] [~andreym] Looking at the current PR it seems
there are separate multiple PRs it can be divided in to. However, I see the concern that testing
by putting these as seprate JIRas will become a daunting task. Hence, it might be ok to merge
it as one off big PR. However, I would like to lay down concerns and shortcoming and that
we are on the same page that these need to be addressed to make IMRU Ft practically usable.

# Currently, if the evaluator fails while we are still in the phase of task submission phase,
we will have an issue where the newly created tasks will wait for a long time in {{WaitForRegistration}}
in Group communication initialization before getting cancelled. Two ways this can be handled:
#* A proper let's get it over with way would be to let driver handle this registration and
synchronization mechanism. Having it at central place will solve lots of issues for us. The
{{GroupCommDriver}} service would need to be extended and addition event handler would need
to be binded so that when driver received the messages from context that GroupComm. service
is ready it can start tasks. This also means having an additional context later in IMRU FT.
#* A less optimal and less preferrable way would be to pass the {{WaitForRegistration}} some
sort of cancellation token or bool sort of variable that it can check after every retry that
whether it needs to come out. The task on receiving the close signal can then simply set this
boolean to true. Will this work? One question: if we are in the constructor of the task, i.e.
driver still has not got an {{IRunningTask}}, is there a way to send close signal and act
on it, I guess not. If not, then I would seriously suggest working on the first above.
# The whole exception handling in {{*TaskHost}} look very convoluted to me. It seems they
were put after testing a lot on cluster and observing exceptions we encountered. What if we
encounter anew sort of exception? I understand this is a trickier problem in general and I
propose to simplify it. There can be multiple sources of exceptions or failures : a) Bug in
base REEF, b) Bug in IMRU, c) Bug in user's map and update tasks, and d) Bug in group communication
since codecs provided by user are buggy or he forgot to provide one. Can't we have a simple
logic where all failures are recoverable? a), b) should not happen in any case since those
are REEF bugs and we should not run IMRU FT while they exist. For c) and d) responsibility
lies with user and it's ok to do weird things there. Infact, Hadoop Map-reduce also does that.
There can be another issue where cluster itself is doing weird things beyond a)-d) although
I can not thing what. Then in any case we can't do much.

Thoughts?



> IMRU Driver handlers for Fault Tolerant
> ---------------------------------------
>
>                 Key: REEF-1251
>                 URL: https://issues.apache.org/jira/browse/REEF-1251
>             Project: REEF
>          Issue Type: Task
>          Components: REEF.NET, REEF.NET Evaluator
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>
> Handles communications between driver and evaluators for evaluator and task recovery
when some evaluators fail. The following describe a flow for an example:
> Here is the control flow in normal scenario:
> a.	All the task, context and task status information is maintained in Task Manager when
tasks are created at the first time
> b.	Task1, task2, Task3 s are queued in Task Starter 
> c.	When all tasks in a group is ready, tasks are submitted
> d.	When tasks start running, task status is updated in Task Manager
> e.	Evaluator 3 failed 
> f.	Driver received failed evaluator event and report it to Evaluator Manager
> g.	Task Manager update task status to set task3 as failed
> h.	Driver send message to task1 and task2 to stop them and update task status in Task
Manager
> i.	Driver request a new evaluator3’ for failed evaluator and submit a new context3’
for it and add a new task3’ to the queue
> j.	Driver recreate task1’ and task2’ with existing context1 and context2 add them
to the queue
> k.	When all the new tasks in the communication group are ready, start tasks as in step
c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message