reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyson Condie (JIRA)" <>
Subject [jira] [Commented] (REEF-1981) Evaluators fail to heartbeat to restarted driver
Date Thu, 01 Mar 2018 20:28:00 GMT


Tyson Condie commented on REEF-1981:

[~markus.weimer] Regarding HTTP rediscovery, I was able to identify the driver-side code that
re-registers with YARN using the current HTTP endpoint. Assuming that is correct, I was not
able to figure out how the HttpHandler that is suppose to respond to such requests get established. 

Specifically, I am assuming that HttpServerReefEventHandler is responsible to responding to
evaluator calls that query for the wake endpoint. However, I do not see any code (e.g., bindings)
that would establish an HttpServerReefEventHandler instance, which would explain the behavior
that we're seeing: specifically, on restart, all evaluators are reported as failed. 

> Evaluators fail to heartbeat to restarted driver
> ------------------------------------------------
>                 Key: REEF-1981
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Evaluator
>            Reporter: Sean Po
>            Priority: Major
> On driver failover, we are hitting the following exception:
> {code}
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager
> WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver
restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can
be requested due to YARN-2433.
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
> org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005]
is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state
[FAILED]. This most likely means that the Evaluator suffered a failure before being used.
>  at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(
>  at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(
>  at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(
>  at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(
>  at org.apache.reef.driver.restart.DriverRestartManager.access$000(
>  at org.apache.reef.driver.restart.DriverRestartManager$
>  at java.util.TimerThread.mainLoop(
>  at
> {code}
> However, according to Yarn RM logs, these containers have not failed at this time. We
suspect that the evaluators are failing to heartbeat into the new Driver.

This message was sent by Atlassian JIRA

View raw message