reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Chung <afchun...@gmail.com>
Subject Re: Issue in REEF 0.14
Date Thu, 19 May 2016 21:06:58 GMT
The CR is out and pending merge.

Thanks,
Andrew

On Thu, May 19, 2016 at 1:55 PM, Boris Shulman <shulmanb@gmail.com> wrote:

> Thanks. Found it already. Do we have an ETA for REEF-1393
> <https://issues.apache.org/jira/browse/REEF-1393>? It is blocking REEF
> 0.15. Also any idea why we did not see it in REEF 0.12 and earlier (did not
> try 0.13)?
>
> On Thu, May 19, 2016 at 1:20 PM, Andrew Chung <afchung90@gmail.com> wrote:
>
> > Hi Boris,
> >
> > This is the issue noted in REEF-1393[0]. Email threads[1][2].
> >
> > Thanks,
> > Andrew
> >
> > [0]: https://issues.apache.org/jira/browse/REEF-1393
> > [1]:
> >
> >
> https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZSxaVA-xyRg2w7US=vzzvY+Qo03psXmw72=xJphJ1gdkg@mail.gmail.com%3E
> > [2]:
> >
> >
> https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZQuYHaxpGX--eOst0WwT8dsJN7FoOSpL=pw3F0NQUZ+6Q@mail.gmail.com%3E
> >
> > On Wed, May 18, 2016 at 11:35 PM, Boris Shulman <shulmanb@gmail.com>
> > wrote:
> >
> > > While working on integrating REEF 0.14 we noticed the following issue:
> > >
> > > On evaluator failure the driver shuts down:
> > >
> > >
> > >
> > > WARNING: ExceptionEvent: local: /100.77.230.68:17237 remote: /
> > > 100.77.210.34:56939 :: java.io.IOException: An existing connection was
> > > forcibly closed by the remote host
> > >
> > > May 19, 2016 6:00:28 AM
> > > org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> > > onEvaluatorException
> > >
> > > WARNING: Failed evaluator: container_1462681171587_0057_01_000002
> > >
> > > org.apache.reef.exception.EvaluatorException: Evaluator
> > > [container_1462681171587_0057_01_000002] is assumed to be in state
> > > [RUNNING]. But the resource manager reports it to be in state [FAILED].
> > > This means that the Evaluator failed but wasn't able to send an error
> > > message back to the driver. Task [streamingNode0] was running when the
> > > Evaluator crashed.
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)
> > >
> > >         at
> > >
> > >
> >
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)
> > >
> > >
> > >
> > > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > > handleFailedEvaluator
> > >
> > > SEVERE: FailedEvaluator
> > >
> > > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > > handleFailedEvaluator
> > >
> > > INFO: removing context streamingNode0 from job driver contexts.
> > >
> > > May 19, 2016 6:00:28 AM
> > > org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler
> > onNext
> > >
> > > INFO: Received a JobStatus message that can't be sent:
> > >
> > > identifier: "9b94916e-d860-4ca0-8ca8-4be412a70d47"
> > >
> > > state: RUNNING
> > >
> > > message: "Evaluator container_1462681171587_0057_01_000002 failed with
> > > message: Evaluator [container_1462681171587_0057_01_000002] is assumed
> to
> > > be in state [RUNNING]. But the resource manager reports it to be in
> state
> > > [FAILED]. This means that the Evaluator failed but wasn\'t able to send
> > an
> > > error message back to the driver. Task [streamingNode0] was running
> when
> > > the Evaluator crashed."
> > >
> > >
> > >
> > > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > > handleFailedEvaluatorInCLR
> > >
> > > INFO: CLR FailedEvaluator handler set, handling things with CLR
> handler.
> > >
> > > May 19, 2016 6:00:28 AM
> > > org.apache.reef.runtime.common.driver.idle.DriverIdleManager
> > > onPotentiallyIdle
> > >
> > > INFO*: All components indicated idle. Initiating Driver shutdown.*
> > >
> > >
> > >
> > >
> > >
> > > I do have Failed Evaluator Handler, and I submit new request:
> > >
> > >
> > >
> > > INFO:
> > >
> >
> +Java_org_apache_reef_javabridge_NativeInterop_clrSystemFailedEvaluatorHandlerOnNext
> > >
> > > <C++> Start: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
> > >
> > > START: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
> > >
> > > <C++> Stop: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
> > >
> > > EXIT: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
> > >
> > > Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper Start: 0 :
> > > 2016-05-19T06:00:28.2886707+00:00 0016
> > >
> > > START: 5/19/2016 6:00:28 AM
> > > ClrSystemHandlerWrapper::Call_ClrSystemFailedEvaluator_OnNext
> > >
> > > <C++> Information: 0 : 2016-05-19T06:00:28.2886707+00:00 0016
> > >
> > > INFO: FailedEvaluatorClr2Java::GetId
> > >
> > > <C++> Start: 0 : 2016-05-19T06:00:28.8042796+00:00 0016
> > >
> > > START: EvaluatorRequestorClr2Java::Submit
> > >
> > >
> > >
> > >
> > >
> > > Boris.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message