reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1223) IMRU Fault Tolerance - restart failed evaluators
Date Wed, 30 Mar 2016 02:37:25 GMT

    [ https://issues.apache.org/jira/browse/REEF-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217269#comment-15217269
] 

Julia edited comment on REEF-1223 at 3/30/16 2:36 AM:
------------------------------------------------------

The following shows the details in driver handlers:

**Driver Start Handler**
* Start Action

**Start Action**
* Reset NumberOfFailedEvaluator and numberoAppErrors 
* change state to WAITING_FOR_EVALUATORS
* Based on the count in the evaluator list, request total - count evaluators

**Allocated Evaluator Handler**
Case WAITING_FOR_EVALUATORS
* Add Evaluator to the Evaluator List
* Submit Context&Service with data load
- If master context is not in the ActiveContext list, submit master context
- Else submit slave context
Case FAIL
* Do nothing

**Active Context Handler**
Case WAITING_FOR_EVALUATORS
* Add Active Context to the ActiveContext List
* If the queue reach to the total number, Submit Tasks Action
Case FAIL:
* Close the ActiveContext

**Submit Tasks Action**
* Change state to SUBMITTING_TASKS
* Create new Communication group and task starter
* For each context, add task to Task list and set status to TASK_NEW
* Make sure one master task, rest are slave tasks
* When the queue reach to the total number, SubmitTask() with master first followed by slaves.
For each submitted task, change task state to TASK_SUBMITTING. 
* Before each submitTask() call, check the system status, if it is not SUBMITTING_TASKS, stop
submitting and change the rest of the task states from TASK_NEW to TASK_CLOSED_BY_DRIVER.

**Running Task Handler**
Case SUBMITTING_TASKS
* Change the task status to TASK_RUNNING
* When all the tasks are running, change system state to TASKS_RUNNING
Case SHUTTING_DOWN/FAIL
* Close task itself and change the task state to TASK_WAITING_FOR_CLOSE (it will be closed
and event will be received in FailedTaskHandler)
Case others
* Log warning

**Failed Evaluator Handler**
Case WAITING_FOR_EVALUATORS
* Remove the evaluator and context from the Evaluator and Active Context lists
* Submit an evaluator

Case SUBMITTING_TASKS/TASKS_RUNNING (first time we got error)
* change status to SHUTTING_DOWN
* NumberOfFailedEvaluator++
* Remove the evaluator and context from the Evaluator and Active Context lists
* If there is an associated task, change the task status to TASK_FAIL_BY_EVALUATOR 
* Close all running tasks Task list
* If all the tasks are in final states 
      if RecoveryCondition == true, Start Action, else fail, take FAIL action

Case SHUTTING_DOWN
* NumberOfFailedEvaluator++
* Remove the evaluator and context from the Evaluator and Active Context lists
* If there is an associated task, change the failed task status to TASK_FAIL_BY_EVALUATOR
* If all the tasks are in final states 
      if RecoveryCondition == true, Start Action, else fail, take FAIL action

Case FAIL:
* Log info and return

**Failed Context Handler**
Case WAITING_FOR_EVALUATORS
* Remove the evaluator and context from the Evaluator and Active Context lists
* If KeepWaitingForEvaluators, Submit an evaluator
* Else FAIL, take FAIL action

**Failed Task Handler**
Case SUBMITTING_TASKS/ TASKS_RUNNING
* change state to SHUTTING_DOWN
* update task state based on the error message
* Close all running tasks in Task list
* If all the tasks are in final states 
      if RecoveryCondition == true, Start Action
      else change system state to FAIL, take FAIL action

Case SHUTTING_DOWN
* If the task state is not TASK_FAIL_BY_EVALUATOR, update task state based on the error message

* If the task state is TASK_WAITING_FOR_CLOSE, update the task status as TASK_CLOSED_BY_DRIVER
* If all the tasks are in final states 
      if RecoveryCondition == true, Start Action
      else change system state to FAIL, take FAIL action

Case FAIL
* Set task state
* Close Active Context
Case WAITING_FOR_EVALUATORS
* Log a warning and do nothing else

**Action FAIL**
* Close all active context

**Recovery Conditions**
Recovery is determined by multiple things:
* numberOfFailedEvaluator – This number is rest when entering WAITING_FOR_EVALUATORS state.
In each recovery, for each failed evaluator, this number will be increased 1. Once this number
reaches to a threshold, we will not be going to recover. 
* numbrOfTry – Each time we enter WAITING_FOR_EVALUATORS state, this number is increased
by 1. Once this number reaches to a threshold, we will not be going to recover. 
* numberOfAppErrors – when receiving an app error, this number is increase bu 1. We will
not recover from app error. 


was (Author: juliaw):
The following shows the details in driver handlers:

Driver Start Handler
* Start Action

Start Action
* Reset NumberOfFailedEvaluator and numberoAppErrors 
* change state to WAITING_FOR_EVALUATORS
* Based on the count in the evaluator list, request total - count evaluators

Allocated Evaluator Handler
Case WAITING_FOR_EVALUATORS
* Add Evaluator to the Evaluator List
* Submit Context&Service with data load
- If master context is not in the ActiveContext list, submit master context
- Else submit slave context
Case FAIL
* Do nothing

Active Context Handler
Case WAITING_FOR_EVALUATORS
* Add Active Context to the ActiveContext List
* If the queue reach to the total number, Submit Tasks Action
Case FAIL:
* Close the ActiveContext

Submit Tasks Action
* Change state to SUBMITTING_TASKS
* Create new Communication group and task starter
* For each context, add task to Task list and set status to TASK_NEW
* Make sure one master task, rest are slave tasks
* When the queue reach to the total number, SubmitTask() with master first followed by slaves.
For each submitted task, change task state to TASK_SUBMITTING. 
* Before each submitTask() call, check the system status, if it is not SUBMITTING_TASKS, stop
submitting and change the rest of the task states from TASK_NEW to TASK_CLOSED_BY_DRIVER.

Running Task Handler
Case SUBMITTING_TASKS
* Change the task status to TASK_RUNNING
* When all the tasks are running, change system state to TASKS_RUNNING
Case SHUTTING_DOWN/FAIL
* Close task itself and change the task state to TASK_WAITING_FOR_CLOSE (it will be closed
and event will be received in FailedTaskHandler)
Case others
* Log warning

Failed Evaluator Handler
Case WAITING_FOR_EVALUATORS
* Remove the evaluator and context from the Evaluator and Active Context lists
* Submit an evaluator
Case SUBMITTING_TASKS/TASKS_RUNNING (first time we got error)
* change status to SHUTTING_DOWN
* NumberOfFailedEvaluator++
* Remove the evaluator and context from the Evaluator and Active Context lists
* If there is an associated task, change the task status to TASK_FAIL_BY_EVALUATOR 
* Close all running tasks Task list
* If all the tasks are in final states 
      if RecoveryCondition == true, Start Action, else fail, take FAIL action
Case SHUTTING_DOWN
* NumberOfFailedEvaluator++
* Remove the evaluator and context from the Evaluator and Active Context lists
* If there is an associated task, change the failed task status to TASK_FAIL_BY_EVALUATOR
* If all the tasks are in final states 
      if RecoveryCondition == true, Start Action, else fail, take FAIL action
Case FAIL:
* Log info and return

Failed Context Handler
Case WAITING_FOR_EVALUATORS
* Remove the evaluator and context from the Evaluator and Active Context lists
* If KeepWaitingForEvaluators, Submit an evaluator
* Else FAIL, take FAIL action

Failed Task Handler
Case SUBMITTING_TASKS/ TASKS_RUNNING
* change state to SHUTTING_DOWN
* update task state based on the error message
* Close all running tasks in Task list
* If all the tasks are in final states 
      if RecoveryCondition == true, Start Action
      else change system state to FAIL, take FAIL action
Case SHUTTING_DOWN
* If the task state is not TASK_FAIL_BY_EVALUATOR, update task state based on the error message

* If the task state is TASK_WAITING_FOR_CLOSE, update the task status as TASK_CLOSED_BY_DRIVER
* If all the tasks are in final states 
      if RecoveryCondition == true, Start Action
      else change system state to FAIL, take FAIL action
Case FAIL
* Set task state
* Close Active Context
Case WAITING_FOR_EVALUATORS
* Log a warning and do nothing else
Action FAIL
* Close all active context

Recovery Conditions
Recovery is determined by multiple things:
* numberOfFailedEvaluator – This number is rest when entering WAITING_FOR_EVALUATORS state.
In each recovery, for each failed evaluator, this number will be increased 1. Once this number
reaches to a threshold, we will not be going to recover. 
* numbrOfTry – Each time we enter WAITING_FOR_EVALUATORS state, this number is increased
by 1. Once this number reaches to a threshold, we will not be going to recover. 
* numberOfAppErrors – when receiving an app error, this number is increase bu 1. We will
not recover from app error. 


> IMRU Fault Tolerance - restart failed evaluators
> ------------------------------------------------
>
>                 Key: REEF-1223
>                 URL: https://issues.apache.org/jira/browse/REEF-1223
>             Project: REEF
>          Issue Type: New Feature
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Julia
>         Attachments: REEF Fault Tolerant Technical design.docx
>
>
> Currently in .Net Group Communication and IMRU scenario, if one of the Evaluator failed
for whatever reason, all the Evaluators will be killed by the driver. 
> There are multiple levels of fault tolerant. The scenario we would like to support in
this JIRA is:
> *  When an evaluator failed, this failed evaluator will be killed and other good Evaluators
will stay, but all the tasks running on those Evaluators will be stopped. 
> *  A new Evaluator will be requested and started with the original task. 
> *  Same tasks will be resubmitted to the rest the Evaluators
> *  The topology of those tasks will be kept in the same group communication as before
> *  The data that have been downloaded in those good Evaluators will stay. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message