reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Commented] (REEF-1981) Evaluators fail to heartbeat to restarted driver
Date Thu, 01 Mar 2018 00:15:00 GMT


Markus Weimer commented on REEF-1981:

{quote}How does step 2 happen if the previous evaluators do not have the HTTP port that the
driver is listening on? The DriverHttpEndpoint file that gets generated by the driver is a
host:port combination.{quote}

I believe the Evaluators poll the RM to get that information.

> Evaluators fail to heartbeat to restarted driver
> ------------------------------------------------
>                 Key: REEF-1981
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Evaluator
>            Reporter: Sean Po
>            Priority: Major
> On driver failover, we are hitting the following exception:
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager
> WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver
restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can
be requested due to YARN-2433.
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
> org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005]
is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state
[FAILED]. This most likely means that the Evaluator suffered a failure before being used.
>  at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(
>  at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(
>  at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(
>  at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(
>  at org.apache.reef.driver.restart.DriverRestartManager.access$000(
>  at org.apache.reef.driver.restart.DriverRestartManager$
>  at java.util.TimerThread.mainLoop(
>  at
> However, according to Yarn RM logs, these containers have not failed at this time. We
suspect that the evaluators are failing to heartbeat into the new Driver.

This message was sent by Atlassian JIRA

View raw message