reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Po (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1981) Evaluators fail to heartbeat to restarted driver
Date Thu, 01 Mar 2018 01:53:00 GMT

    [ https://issues.apache.org/jira/browse/REEF-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381346#comment-16381346
] 

Sean Po edited comment on REEF-1981 at 3/1/18 1:52 AM:
-------------------------------------------------------

Thanks [~bgchun] for adding to the thread. For more information, I also noticed this file:
{{lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\DriverConnection.java}}
that seems to be responsible for re-identifying the Driver Endpoint that isn't implemented.
Tracked by REEF-843.

I might be reading this incorrectly, but it seems that the DriverRemoteIdentifier.class is
the named parameter that's supposed to include the driver endpoint. If it's injected, then
it seems to suggest that this information isn't periodically checked. I'm seeing this from
{{lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\EvaluatorRuntime.java}}
and {{lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\HeartBeatManager.java}}.


was (Author: seanpo03):
Thanks [~bgchun] for adding to the thread. For more information, I also noticed this file:
{{lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\DriverConnection.java}}
that seems to be responsible for re-identifying the Driver Endpoint that isn't implemented.

I might be reading this incorrectly, but it seems that the DriverRemoteIdentifier.class is
the named parameter that's supposed to include the driver endpoint. If it's injected, then
it seems to suggest that this information isn't periodically checked. I'm seeing this from
{{lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\EvaluatorRuntime.java}}
and {{lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\HeartBeatManager.java}}.

> Evaluators fail to heartbeat to restarted driver
> ------------------------------------------------
>
>                 Key: REEF-1981
>                 URL: https://issues.apache.org/jira/browse/REEF-1981
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Evaluator
>            Reporter: Sean Po
>            Priority: Major
>
> On driver failover, we are hitting the following exception:
> {code}
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager
informAboutEvaluatorFailures
> WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver
restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can
be requested due to YARN-2433.
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
onEvaluatorException
> WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
> org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005]
is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state
[FAILED]. This most likely means that the Evaluator suffered a failure before being used.
>  at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:693)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:91)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:38)
>  at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:93)
>  at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(YarnDriverRuntimeRestartManager.java:230)
>  at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(DriverRestartManager.java:282)
>  at org.apache.reef.driver.restart.DriverRestartManager.access$000(DriverRestartManager.java:47)
>  at org.apache.reef.driver.restart.DriverRestartManager$1.run(DriverRestartManager.java:136)
>  at java.util.TimerThread.mainLoop(Timer.java:555)
>  at java.util.TimerThread.run(Timer.java:505)
> {code}
> However, according to Yarn RM logs, these containers have not failed at this time. We
suspect that the evaluators are failing to heartbeat into the new Driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message