reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Shulman <shulm...@gmail.com>
Subject Issue in REEF 0.14
Date Thu, 19 May 2016 06:35:06 GMT
While working on integrating REEF 0.14 we noticed the following issue:

On evaluator failure the driver shuts down:



WARNING: ExceptionEvent: local: /100.77.230.68:17237 remote: /
100.77.210.34:56939 :: java.io.IOException: An existing connection was
forcibly closed by the remote host

May 19, 2016 6:00:28 AM
org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
onEvaluatorException

WARNING: Failed evaluator: container_1462681171587_0057_01_000002

org.apache.reef.exception.EvaluatorException: Evaluator
[container_1462681171587_0057_01_000002] is assumed to be in state
[RUNNING]. But the resource manager reports it to be in state [FAILED].
This means that the Evaluator failed but wasn't able to send an error
message back to the driver. Task [streamingNode0] was running when the
Evaluator crashed.

        at
org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)

        at
org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)

        at
org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)

        at
org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)

        at
org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)

        at
org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)

        at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)



May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
handleFailedEvaluator

SEVERE: FailedEvaluator

May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
handleFailedEvaluator

INFO: removing context streamingNode0 from job driver contexts.

May 19, 2016 6:00:28 AM
org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler onNext

INFO: Received a JobStatus message that can't be sent:

identifier: "9b94916e-d860-4ca0-8ca8-4be412a70d47"

state: RUNNING

message: "Evaluator container_1462681171587_0057_01_000002 failed with
message: Evaluator [container_1462681171587_0057_01_000002] is assumed to
be in state [RUNNING]. But the resource manager reports it to be in state
[FAILED]. This means that the Evaluator failed but wasn\'t able to send an
error message back to the driver. Task [streamingNode0] was running when
the Evaluator crashed."



May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
handleFailedEvaluatorInCLR

INFO: CLR FailedEvaluator handler set, handling things with CLR handler.

May 19, 2016 6:00:28 AM
org.apache.reef.runtime.common.driver.idle.DriverIdleManager
onPotentiallyIdle

INFO*: All components indicated idle. Initiating Driver shutdown.*





I do have Failed Evaluator Handler, and I submit new request:



INFO: +Java_org_apache_reef_javabridge_NativeInterop_clrSystemFailedEvaluatorHandlerOnNext

<C++> Start: 0 : 2016-05-19T06:00:28.2729926+00:00 0016

START: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java

<C++> Stop: 0 : 2016-05-19T06:00:28.2729926+00:00 0016

EXIT: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java

Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper Start: 0 :
2016-05-19T06:00:28.2886707+00:00 0016

START: 5/19/2016 6:00:28 AM
ClrSystemHandlerWrapper::Call_ClrSystemFailedEvaluator_OnNext

<C++> Information: 0 : 2016-05-19T06:00:28.2886707+00:00 0016

INFO: FailedEvaluatorClr2Java::GetId

<C++> Start: 0 : 2016-05-19T06:00:28.8042796+00:00 0016

START: EvaluatorRequestorClr2Java::Submit





Boris.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message