reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyson Condie (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1981) Evaluators fail to heartbeat to restarted driver
Date Thu, 01 Mar 2018 22:01:00 GMT

    [ https://issues.apache.org/jira/browse/REEF-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382725#comment-16382725
] 

Tyson Condie commented on REEF-1981:
------------------------------------

We are still having problems identifying the code that polls the Yarn RM for the new driver
http endpoint, and then subsequently calls the driver http endpoint to obtain the new wake
endpoint. 

> Evaluators fail to heartbeat to restarted driver
> ------------------------------------------------
>
>                 Key: REEF-1981
>                 URL: https://issues.apache.org/jira/browse/REEF-1981
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Evaluator
>            Reporter: Sean Po
>            Priority: Major
>
> On driver failover, we are hitting the following exception:
> {code}
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager
informAboutEvaluatorFailures
> WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver
restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can
be requested due to YARN-2433.
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
onEvaluatorException
> WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
> org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005]
is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state
[FAILED]. This most likely means that the Evaluator suffered a failure before being used.
>  at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:693)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:91)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:38)
>  at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:93)
>  at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(YarnDriverRuntimeRestartManager.java:230)
>  at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(DriverRestartManager.java:282)
>  at org.apache.reef.driver.restart.DriverRestartManager.access$000(DriverRestartManager.java:47)
>  at org.apache.reef.driver.restart.DriverRestartManager$1.run(DriverRestartManager.java:136)
>  at java.util.TimerThread.mainLoop(Timer.java:555)
>  at java.util.TimerThread.run(Timer.java:505)
> {code}
> However, according to Yarn RM logs, these containers have not failed at this time. We
suspect that the evaluators are failing to heartbeat into the new Driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message