reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Chung (JIRA)" <>
Subject [jira] [Commented] (REEF-1981) Evaluators fail to heartbeat to restarted driver
Date Sat, 03 Mar 2018 15:56:00 GMT


Andrew Chung commented on REEF-1981:

I believe the codepath is only available in the C# Evaluator.
 [~seanpo03] is right, ``lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\
`` should be the file in Java in which the evaluators query for the Driver connection.
The corresponding implemented version in the C# Evaluator is ``reef/lang/cs/Org.Apache.REEF.Common/Evaluator/DriverInformation.cs``,
with the querying of the Driver endpoint happening here: ``reef/lang/cs/Org.Apache.REEF.Common/Evaluator/DriverInformation.cs``.
The Java side should implement the same or similar logic for Driver failover to work.

> Evaluators fail to heartbeat to restarted driver
> ------------------------------------------------
>                 Key: REEF-1981
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Evaluator
>            Reporter: Sean Po
>            Priority: Major
> On driver failover, we are hitting the following exception:
> {code}
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager
> WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver
restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can
be requested due to YARN-2433.
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
> org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005]
is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state
[FAILED]. This most likely means that the Evaluator suffered a failure before being used.
>  at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(
>  at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(
>  at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(
>  at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(
>  at org.apache.reef.driver.restart.DriverRestartManager.access$000(
>  at org.apache.reef.driver.restart.DriverRestartManager$
>  at java.util.TimerThread.mainLoop(
>  at
> {code}
> However, according to Yarn RM logs, these containers have not failed at this time. We
suspect that the evaluators are failing to heartbeat into the new Driver.

This message was sent by Atlassian JIRA

View raw message