reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1981) Evaluators fail to heartbeat to restarted driver
Date Wed, 28 Feb 2018 23:59:00 GMT

    [ https://issues.apache.org/jira/browse/REEF-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381262#comment-16381262
] 

Markus Weimer commented on REEF-1981:
-------------------------------------

{quote}it seems like AM/Driver discovery occurs via HDFS.{quote}

Kinda. IIRC, this is how it is supposed to work:

# The Evaluators detect that the Driver is down and start polling the RM for a restarted AM.
# Once they get the restarted IM's IP address, they inquire for its Wake port via its HTTP
interface.
# The Evaluators start heartbeating to the restarted Driver.

Meanwhile, the restarted Driver forms an opinion on what Evaluators it should expect to hear
from. If forms that opinion from two sources:

# YARN
# HDFS. This is / was necessary because YARN could not be trusted to communicate the right
set of Containers to expect at the time this feature was introduced into REEF.

> Evaluators fail to heartbeat to restarted driver
> ------------------------------------------------
>
>                 Key: REEF-1981
>                 URL: https://issues.apache.org/jira/browse/REEF-1981
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Evaluator
>            Reporter: Sean Po
>            Priority: Major
>
> On driver failover, we are hitting the following exception:
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager
informAboutEvaluatorFailures
> WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver
restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can
be requested due to YARN-2433.
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
onEvaluatorException
> WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
> org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005]
is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state
[FAILED]. This most likely means that the Evaluator suffered a failure before being used.
>  at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:693)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:91)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:38)
>  at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:93)
>  at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(YarnDriverRuntimeRestartManager.java:230)
>  at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(DriverRestartManager.java:282)
>  at org.apache.reef.driver.restart.DriverRestartManager.access$000(DriverRestartManager.java:47)
>  at org.apache.reef.driver.restart.DriverRestartManager$1.run(DriverRestartManager.java:136)
>  at java.util.TimerThread.mainLoop(Timer.java:555)
>  at java.util.TimerThread.run(Timer.java:505)
>  
> However, according to Yarn RM logs, these containers have not failed at this time. We
suspect that the evaluators are failing to heartbeat into the new Driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message