reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyson Condie (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1981) Evaluators fail to heartbeat to restarted driver
Date Thu, 01 Mar 2018 22:03:00 GMT

    [ https://issues.apache.org/jira/browse/REEF-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382725#comment-16382725
] 

Tyson Condie edited comment on REEF-1981 at 3/1/18 10:02 PM:
-------------------------------------------------------------

We are still having problems identifying the code that polls the Yarn RM for the new driver
http endpoint, and then subsequently calls the driver http endpoint to obtain the new wake
endpoint.

We believe this code path is completely missing. Evidence: not YARN specific code exists in
reef-runtime-yarn to handle the polling to the RM. 


was (Author: tcondie):
We are still having problems identifying the code that polls the Yarn RM for the new driver
http endpoint, and then subsequently calls the driver http endpoint to obtain the new wake
endpoint. 

> Evaluators fail to heartbeat to restarted driver
> ------------------------------------------------
>
>                 Key: REEF-1981
>                 URL: https://issues.apache.org/jira/browse/REEF-1981
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Evaluator
>            Reporter: Sean Po
>            Priority: Major
>
> On driver failover, we are hitting the following exception:
> {code}
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager
informAboutEvaluatorFailures
> WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver
restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can
be requested due to YARN-2433.
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
onEvaluatorException
> WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
> org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005]
is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state
[FAILED]. This most likely means that the Evaluator suffered a failure before being used.
>  at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:693)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:91)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:38)
>  at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:93)
>  at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(YarnDriverRuntimeRestartManager.java:230)
>  at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(DriverRestartManager.java:282)
>  at org.apache.reef.driver.restart.DriverRestartManager.access$000(DriverRestartManager.java:47)
>  at org.apache.reef.driver.restart.DriverRestartManager$1.run(DriverRestartManager.java:136)
>  at java.util.TimerThread.mainLoop(Timer.java:555)
>  at java.util.TimerThread.run(Timer.java:505)
> {code}
> However, according to Yarn RM logs, these containers have not failed at this time. We
suspect that the evaluators are failing to heartbeat into the new Driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message