reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Chung <afchun...@gmail.com>
Subject Re: Issue in REEF 0.14
Date Thu, 19 May 2016 20:20:58 GMT
Hi Boris,

This is the issue noted in REEF-1393[0]. Email threads[1][2].

Thanks,
Andrew

[0]: https://issues.apache.org/jira/browse/REEF-1393
[1]:
https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZSxaVA-xyRg2w7US=vzzvY+Qo03psXmw72=xJphJ1gdkg@mail.gmail.com%3E
[2]:
https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZQuYHaxpGX--eOst0WwT8dsJN7FoOSpL=pw3F0NQUZ+6Q@mail.gmail.com%3E

On Wed, May 18, 2016 at 11:35 PM, Boris Shulman <shulmanb@gmail.com> wrote:

> While working on integrating REEF 0.14 we noticed the following issue:
>
> On evaluator failure the driver shuts down:
>
>
>
> WARNING: ExceptionEvent: local: /100.77.230.68:17237 remote: /
> 100.77.210.34:56939 :: java.io.IOException: An existing connection was
> forcibly closed by the remote host
>
> May 19, 2016 6:00:28 AM
> org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> onEvaluatorException
>
> WARNING: Failed evaluator: container_1462681171587_0057_01_000002
>
> org.apache.reef.exception.EvaluatorException: Evaluator
> [container_1462681171587_0057_01_000002] is assumed to be in state
> [RUNNING]. But the resource manager reports it to be in state [FAILED].
> This means that the Evaluator failed but wasn't able to send an error
> message back to the driver. Task [streamingNode0] was running when the
> Evaluator crashed.
>
>         at
>
> org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)
>
>         at
>
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)
>
>         at
>
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)
>
>         at
>
> org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)
>
>         at
>
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)
>
>         at
>
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)
>
>         at
>
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)
>
>
>
> May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> handleFailedEvaluator
>
> SEVERE: FailedEvaluator
>
> May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> handleFailedEvaluator
>
> INFO: removing context streamingNode0 from job driver contexts.
>
> May 19, 2016 6:00:28 AM
> org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler onNext
>
> INFO: Received a JobStatus message that can't be sent:
>
> identifier: "9b94916e-d860-4ca0-8ca8-4be412a70d47"
>
> state: RUNNING
>
> message: "Evaluator container_1462681171587_0057_01_000002 failed with
> message: Evaluator [container_1462681171587_0057_01_000002] is assumed to
> be in state [RUNNING]. But the resource manager reports it to be in state
> [FAILED]. This means that the Evaluator failed but wasn\'t able to send an
> error message back to the driver. Task [streamingNode0] was running when
> the Evaluator crashed."
>
>
>
> May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> handleFailedEvaluatorInCLR
>
> INFO: CLR FailedEvaluator handler set, handling things with CLR handler.
>
> May 19, 2016 6:00:28 AM
> org.apache.reef.runtime.common.driver.idle.DriverIdleManager
> onPotentiallyIdle
>
> INFO*: All components indicated idle. Initiating Driver shutdown.*
>
>
>
>
>
> I do have Failed Evaluator Handler, and I submit new request:
>
>
>
> INFO:
> +Java_org_apache_reef_javabridge_NativeInterop_clrSystemFailedEvaluatorHandlerOnNext
>
> <C++> Start: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
>
> START: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
>
> <C++> Stop: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
>
> EXIT: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
>
> Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper Start: 0 :
> 2016-05-19T06:00:28.2886707+00:00 0016
>
> START: 5/19/2016 6:00:28 AM
> ClrSystemHandlerWrapper::Call_ClrSystemFailedEvaluator_OnNext
>
> <C++> Information: 0 : 2016-05-19T06:00:28.2886707+00:00 0016
>
> INFO: FailedEvaluatorClr2Java::GetId
>
> <C++> Start: 0 : 2016-05-19T06:00:28.8042796+00:00 0016
>
> START: EvaluatorRequestorClr2Java::Submit
>
>
>
>
>
> Boris.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message