reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Created] (REEF-1870) Kill slower Evaluators in IMRU after timeout in data loading
Date Wed, 23 Aug 2017 02:48:00 GMT
Julia created REEF-1870:
---------------------------

             Summary: Kill slower Evaluators in IMRU after timeout in data loading
                 Key: REEF-1870
                 URL: https://issues.apache.org/jira/browse/REEF-1870
             Project: REEF
          Issue Type: Improvement
            Reporter: Julia


The job was submitted totally 4 retriesIn each retry, most of the Jobs can finish data downloading/deserialization
within 6-30 minutes. There are about 3 evaluators which are very slow. The slowest one took
about 2-8 hours to download data/deserialization in each retry. The retry was triggered after
30 min timeout (configurable)Driver cannot send close event to those slower evaluators before
they complete data loading and then send IRunningTask event to driver. After long running
time, the Job was killed. 

A simple band-aid is to kill the evaluators from which we do not receive RunningTask after
the 30 min timeout along with cancelling the RunningTasks that have been received. Its needless
to wait 8 hours to cancel the RunningTasks that just complete downloading/deserializing the
data. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message