reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1870) Kill slower Evaluators in IMRU after timeout in data loading
Date Wed, 23 Aug 2017 22:27:00 GMT

    [ https://issues.apache.org/jira/browse/REEF-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139244#comment-16139244
] 

Markus Weimer commented on REEF-1870:
-------------------------------------

Who is doing the Evaluator killing? The Driver or the Evaluator?

> Kill slower Evaluators in IMRU after timeout in data loading
> ------------------------------------------------------------
>
>                 Key: REEF-1870
>                 URL: https://issues.apache.org/jira/browse/REEF-1870
>             Project: REEF
>          Issue Type: Improvement
>          Components: IMRU, REEF
>    Affects Versions: 0.17
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>
> The job was submitted totally 4 retriesIn each retry, most of the Jobs can finish data
downloading/deserialization within 6-30 minutes. There are about 3 evaluators which are very
slow. The slowest one took about 2-8 hours to download data/deserialization in each retry.
The retry was triggered after 30 min timeout (configurable)Driver cannot send close event
to those slower evaluators before they complete data loading and then send IRunningTask event
to driver. After long running time, the Job was killed. 
> A simple band-aid is to kill the evaluators from which we do not receive RunningTask
after the 30 min timeout along with cancelling the RunningTasks that have been received. Its
needless to wait 8 hours to cancel the RunningTasks that just complete downloading/deserializing
the data. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message