reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1251) Driver handlers in Evaluator recovery
Date Tue, 12 Apr 2016 01:05:25 GMT

    [ https://issues.apache.org/jira/browse/REEF-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236380#comment-15236380
] 

Julia commented on REEF-1251:
-----------------------------

The following shows the details in each driver handler. 
Driver Start Handler
•	Start Action
Start Action
•	Reset NumberOfFailedEvaluator and numberoAppErrors 
•	change state to WAITING_FOR_EVALUATORS
h.	Based on the count in the evaluator list, request total - count evaluators
Allocated Evaluator Handler
Case WAITING_FOR_EVALUATORS
•	Add Evaluator to the Evaluator List
•	Submit Context&Service with data load
o	If master context is not in the ActiveContext list, submit master context
o	Else submit slave context
Case FAIL
	Do nothing
Active Context Handler
Case WAITING_FOR_EVALUATORS
•	Add Active Context to the ActiveContext List
•	If the queue reach to the total number, Submit Tasks Action
Case FAIL:
•	Close the ActiveContext
Submit Tasks Action
•	Change state to SUBMITTING_TASKS
•	Create new Communication group and task starter
•	For each context, add task to Task list and set status to TASK_NEW
•	Make sure one master task, rest are slave tasks
•	When the queue reach to the total number, SubmitTask() with master first followed by slaves.
For each submitted task, change task state to TASK_SUBMITTING. 
•	Before each submitTask() call, check the system status, if it is not SUBMITTING_TASKS,
stop submitting and change the rest of the task states from TASK_NEW to TASK_CLOSED_BY_DRIVER.
Running Task Handler
Case SUBMITTING_TASKS
•	Change the task status to TASK_RUNNING
•	When all the tasks are running, change system state to TASKS_RUNNING
Case SHUTTING_DOWN/FAIL
•	Close task itself and change the task state to TASK_WAITING_FOR_CLOSE (it will be closed
and event will be received in FailedTaskHandler)
Case others
•	Log warning

Failed Evaluator Handler
Case WAITING_FOR_EVALUATORS
•	Remove the evaluator and context from the Evaluator and Active Context lists (if there
is no context attached to the failed evaluator, meaning context is not submitted, only remove
the failed evaluator)
•	Submit an evaluator
Case SUBMITTING_TASKS/TASKS_RUNNING (first time we got error)
•	change status to SHUTTING_DOWN
•	NumberOfFailedEvaluator++
•	Remove the evaluator and context from the Evaluator and Active Context lists
•	If there is an associated task, change the task status to TASK_FAIL_BY_EVALUATOR 
•	Close all running tasks Task list
•	If all the tasks are in final states 
      if RecoveryCondition == true, Start Action, else fail, take FAIL action
Case SHUTTING_DOWN
•	NumberOfFailedEvaluator++
•	Remove the evaluator and context from the Evaluator and Active Context lists
•	If there is an associated task, change the failed task status to TASK_FAIL_BY_EVALUATOR
•	If all the tasks are in final states 
      if RecoveryCondition == true, Start Action, else fail, take FAIL action
Case FAIL:
•	Log info and return

Failed Context Handler
Case WAITING_FOR_EVALUATORS
•	Remove the evaluator and context from the Evaluator and Active Context lists
•	If KeepWaitingForEvaluators, Submit an evaluator
•	Else FAIL, take FAIL action

Failed Task Handler
Case SUBMITTING_TASKS/ TASKS_RUNNING
•	change state to SHUTTING_DOWN
•	update task state based on the error message
•	Close all running tasks in Task list
•	If all the tasks are in final states 
      if RecoveryCondition == true, Start Action
      else change system state to FAIL, take FAIL action
Case SHUTTING_DOWN
•	If the task state is not TASK_FAIL_BY_EVALUATOR, update task state based on the error
message 
•	If the task state is TASK_WAITING_FOR_CLOSE, update the task status as TASK_CLOSED_BY_DRIVER
•	If all the tasks are in final states 
      if RecoveryCondition == true, Start Action
      else change system state to FAIL, take FAIL action
Case FAIL
•	Set task state
•	Close Active Context
Case WAITING_FOR_EVALUATORS
•	Log a warning and do nothing else
Action FAIL
•	Close all active context

Recovery Conditions
Recovery is determined by multiple things:
•	numberOfFailedEvaluator – This number is rest when entering WAITING_FOR_EVALUATORS state.
In each recovery, for each failed evaluator, this number will be increased 1. Once this number
reaches to a threshold, we will not be going to recover. 
•	numbrOfTry – Each time we enter WAITING_FOR_EVALUATORS state, this number is increased
by 1. Once this number reaches to a threshold, we will not be going to recover. 
•	numberOfAppErrors – when receiving an app error, this number is increase bu 1. We will
not recover from app error. 


> Driver handlers in Evaluator recovery 
> --------------------------------------
>
>                 Key: REEF-1251
>                 URL: https://issues.apache.org/jira/browse/REEF-1251
>             Project: REEF
>          Issue Type: Task
>          Components: REEF.NET, REEF.NET Evaluator
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>
> Handles communications between driver and evaluators for evaluator and task recovery
when some evaluators fail. The following describe a flow for an example:
> Here is the control flow in normal scenario:
> a.	All the task, context and task status information is maintained in Task Manager when
tasks are created at the first time
> b.	Task1, task2, Task3 s are queued in Task Starter 
> c.	When all tasks in a group is ready, tasks are submitted
> d.	When tasks start running, task status is updated in Task Manager
> e.	Evaluator 3 failed 
> f.	Driver received failed evaluator event and report it to Evaluator Manager
> g.	Task Manager update task status to set task3 as failed
> h.	Driver send message to task1 and task2 to stop them and update task status in Task
Manager
> i.	Driver request a new evaluator3’ for failed evaluator and submit a new context3’
for it and add a new task3’ to the queue
> j.	Driver recreate task1’ and task2’ with existing context1 and context2 add them
to the queue
> k.	When all the new tasks in the communication group are ready, start tasks as in step
c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message