reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (REEF-1248) Identify the scenarios that need to restart evaluators
Date Tue, 07 Jun 2016 07:35:20 GMT

     [ https://issues.apache.org/jira/browse/REEF-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julia resolved REEF-1248.
-------------------------
    Resolution: Fixed

We will handle failed evaluator as long as it is not from update evaluator. 
The max number of failed evaluator is configurable. 
Details should be covered in design doc.  

> Identify the scenarios that need to restart evaluators
> ------------------------------------------------------
>
>                 Key: REEF-1248
>                 URL: https://issues.apache.org/jira/browse/REEF-1248
>             Project: REEF
>          Issue Type: Task
>          Components: REEF.NET
>            Reporter: Julia
>              Labels: FT
>
> a.	Any transit app error should have retry logic inside code. After retry, if it still
fails, restart server won’t help. 
> b.	Any expected app exceptions should be not recoverable
> c.	Unexpected app exceptions should be not recoverable
> Resource issue
> a.	Evaluator is killed by RM. We should response to this case
> System Error
> a.	System issue causing a machine crash
> b.	Other system error we encountered in 10 month data testing, what are the exact events
received? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message