reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Shulman <shulm...@gmail.com>
Subject Re: Issue in REEF 0.14
Date Thu, 19 May 2016 20:55:41 GMT
Thanks. Found it already. Do we have an ETA for REEF-1393
<https://issues.apache.org/jira/browse/REEF-1393>? It is blocking REEF
0.15. Also any idea why we did not see it in REEF 0.12 and earlier (did not
try 0.13)?

On Thu, May 19, 2016 at 1:20 PM, Andrew Chung <afchung90@gmail.com> wrote:

> Hi Boris,
>
> This is the issue noted in REEF-1393[0]. Email threads[1][2].
>
> Thanks,
> Andrew
>
> [0]: https://issues.apache.org/jira/browse/REEF-1393
> [1]:
>
> https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZSxaVA-xyRg2w7US=vzzvY+Qo03psXmw72=xJphJ1gdkg@mail.gmail.com%3E
> [2]:
>
> https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZQuYHaxpGX--eOst0WwT8dsJN7FoOSpL=pw3F0NQUZ+6Q@mail.gmail.com%3E
>
> On Wed, May 18, 2016 at 11:35 PM, Boris Shulman <shulmanb@gmail.com>
> wrote:
>
> > While working on integrating REEF 0.14 we noticed the following issue:
> >
> > On evaluator failure the driver shuts down:
> >
> >
> >
> > WARNING: ExceptionEvent: local: /100.77.230.68:17237 remote: /
> > 100.77.210.34:56939 :: java.io.IOException: An existing connection was
> > forcibly closed by the remote host
> >
> > May 19, 2016 6:00:28 AM
> > org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> > onEvaluatorException
> >
> > WARNING: Failed evaluator: container_1462681171587_0057_01_000002
> >
> > org.apache.reef.exception.EvaluatorException: Evaluator
> > [container_1462681171587_0057_01_000002] is assumed to be in state
> > [RUNNING]. But the resource manager reports it to be in state [FAILED].
> > This means that the Evaluator failed but wasn't able to send an error
> > message back to the driver. Task [streamingNode0] was running when the
> > Evaluator crashed.
> >
> >         at
> >
> >
> org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)
> >
> >         at
> >
> >
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)
> >
> >         at
> >
> >
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)
> >
> >         at
> >
> >
> org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)
> >
> >         at
> >
> >
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)
> >
> >         at
> >
> >
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)
> >
> >         at
> >
> >
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)
> >
> >
> >
> > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > handleFailedEvaluator
> >
> > SEVERE: FailedEvaluator
> >
> > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > handleFailedEvaluator
> >
> > INFO: removing context streamingNode0 from job driver contexts.
> >
> > May 19, 2016 6:00:28 AM
> > org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler
> onNext
> >
> > INFO: Received a JobStatus message that can't be sent:
> >
> > identifier: "9b94916e-d860-4ca0-8ca8-4be412a70d47"
> >
> > state: RUNNING
> >
> > message: "Evaluator container_1462681171587_0057_01_000002 failed with
> > message: Evaluator [container_1462681171587_0057_01_000002] is assumed to
> > be in state [RUNNING]. But the resource manager reports it to be in state
> > [FAILED]. This means that the Evaluator failed but wasn\'t able to send
> an
> > error message back to the driver. Task [streamingNode0] was running when
> > the Evaluator crashed."
> >
> >
> >
> > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > handleFailedEvaluatorInCLR
> >
> > INFO: CLR FailedEvaluator handler set, handling things with CLR handler.
> >
> > May 19, 2016 6:00:28 AM
> > org.apache.reef.runtime.common.driver.idle.DriverIdleManager
> > onPotentiallyIdle
> >
> > INFO*: All components indicated idle. Initiating Driver shutdown.*
> >
> >
> >
> >
> >
> > I do have Failed Evaluator Handler, and I submit new request:
> >
> >
> >
> > INFO:
> >
> +Java_org_apache_reef_javabridge_NativeInterop_clrSystemFailedEvaluatorHandlerOnNext
> >
> > <C++> Start: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
> >
> > START: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
> >
> > <C++> Stop: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
> >
> > EXIT: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
> >
> > Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper Start: 0 :
> > 2016-05-19T06:00:28.2886707+00:00 0016
> >
> > START: 5/19/2016 6:00:28 AM
> > ClrSystemHandlerWrapper::Call_ClrSystemFailedEvaluator_OnNext
> >
> > <C++> Information: 0 : 2016-05-19T06:00:28.2886707+00:00 0016
> >
> > INFO: FailedEvaluatorClr2Java::GetId
> >
> > <C++> Start: 0 : 2016-05-19T06:00:28.8042796+00:00 0016
> >
> > START: EvaluatorRequestorClr2Java::Submit
> >
> >
> >
> >
> >
> > Boris.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message