reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Chung (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1981) Evaluators fail to heartbeat to restarted driver
Date Sat, 03 Mar 2018 15:57:00 GMT

    [ https://issues.apache.org/jira/browse/REEF-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384710#comment-16384710
] 

Andrew Chung edited comment on REEF-1981 at 3/3/18 3:56 PM:
------------------------------------------------------------

I believe the codepath is only available in the C# Evaluator.
 [~seanpo03] is right, {{lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\DriverConnection.java}}
should be the file in Java in which the evaluators query for the Driver connection.
 The corresponding implemented version in the C# Evaluator is {{reef/lang/cs/Org.Apache.REEF.Common/Evaluator/DriverInformation.cs}},
with the querying of the Driver endpoint happening here: {{reef/lang/cs/Org.Apache.REEF.Common/Evaluator/DriverInformation.cs}}.
 The Java side should implement the same or similar logic for Driver failover to work.


was (Author: afchung90):
I believe the codepath is only available in the C# Evaluator.
 [~seanpo03] is right, ``lang\java\reef-common\src\main\java\org\apache\reef\runtime\common\evaluator\DriverConnection.java
`` should be the file in Java in which the evaluators query for the Driver connection.
 The corresponding implemented version in the C# Evaluator is `reef/lang/cs/Org.Apache.REEF.Common/Evaluator/DriverInformation.cs`,
with the querying of the Driver endpoint happening here: ``reef/lang/cs/Org.Apache.REEF.Common/Evaluator/DriverInformation.cs``.
 The Java side should implement the same or similar logic for Driver failover to work.

> Evaluators fail to heartbeat to restarted driver
> ------------------------------------------------
>
>                 Key: REEF-1981
>                 URL: https://issues.apache.org/jira/browse/REEF-1981
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Evaluator
>            Reporter: Sean Po
>            Priority: Major
>
> On driver failover, we are hitting the following exception:
> {code}
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager
informAboutEvaluatorFailures
> WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver
restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can
be requested due to YARN-2433.
> Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
onEvaluatorException
> WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
> org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005]
is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state
[FAILED]. This most likely means that the Evaluator suffered a failure before being used.
>  at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:693)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:91)
>  at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:38)
>  at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:93)
>  at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(YarnDriverRuntimeRestartManager.java:230)
>  at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(DriverRestartManager.java:282)
>  at org.apache.reef.driver.restart.DriverRestartManager.access$000(DriverRestartManager.java:47)
>  at org.apache.reef.driver.restart.DriverRestartManager$1.run(DriverRestartManager.java:136)
>  at java.util.TimerThread.mainLoop(Timer.java:555)
>  at java.util.TimerThread.run(Timer.java:505)
> {code}
> However, according to Yarn RM logs, these containers have not failed at this time. We
suspect that the evaluators are failing to heartbeat into the new Driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message