reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanha Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1472) implement timeout for submit context and task
Date Mon, 25 Jul 2016 08:34:20 GMT

    [ https://issues.apache.org/jira/browse/REEF-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391534#comment-15391534
] 

Sanha Lee commented on REEF-1472:
---------------------------------

At present, driver sends {{submitTask}} / {{submitContext}} request to evaluator using {{EvaluatorControlProto}}
asynchronously.
When {{EvaluatorRuntime}} receives this request, it produces new Task or Context and sends
a notice to driver using heartbeat.

Therefore, if {{submitTask}} / {{submitContext}} fails to send proper message with heartbeat,
we can assume that the evaluator is crashed before heartbeat or heartbeat messaging doesn't
work.
In both cases, checking timeout and sending notice to driver is impossible for evaluator.
Furthermore, it is also impossible to check timeout in driver level solely because it sends
request asynchronously.

In my opinion, current notice system with heartbeat is the best what evaluator can do for
{{submitTask}} / {{submitContext}}.
Is there anyone who has different opinion? If not, I will close this issue.

> implement timeout for submit context and task 
> ----------------------------------------------
>
>                 Key: REEF-1472
>                 URL: https://issues.apache.org/jira/browse/REEF-1472
>             Project: REEF
>          Issue Type: Improvement
>          Components: REEF Driver, REEF Evaluator
>            Reporter: Andrey
>            Assignee: Sanha Lee
>            Priority: Minor
>              Labels: FT
>
> if SubmitContext /Submit Task fails to generate corresponding Active/Running/Failed events,
IMRUDriver will hang around until RM times out whole job. By design this should never happen
unless there are bugs in the framework. So as improvement, we may want to create timeouts
for these scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message