Mailing-List: contact dev-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
MIME-Version: 1.0
References: 
 <CAK5ODX6xz=3f-dOKggv20Ljecb8-Wqcy+oFyn5sXr-Vkcw0KmA@mail.gmail.com>
 <CAC27z=P+i44WyffU=SPuSMbHvrUh3uzyKUnCeA9a4udDMXh1Aw@mail.gmail.com>
 <CAK5ODX7f6GpuJQLjXYArDBn0wLxR28AP9Rf577Qc748OS9GSnw@mail.gmail.com>
In-Reply-To: 
 <CAK5ODX7f6GpuJQLjXYArDBn0wLxR28AP9Rf577Qc748OS9GSnw@mail.gmail.com>
From: Till Rohrmann <trohrmann@apache.org>
Date: Fri, 19 Jun 2015 12:40:05 +0000
Message-ID: 
 <CAC27z=NoKFXVi470TP8=0bwPc4p33DU=FP5GnLJRzJQWV9rxnw@mail.gmail.com>
Subject: Re: Flink Runtime Exception
To: dev@flink.apache.org
Content-Type: multipart/alternative; boundary=047d7bb04272d686f60518de396e

--047d7bb04272d686f60518de396e
Content-Type: text/plain; charset=UTF-8

Yes, it was an issue for the milestone release.

On Fri, Jun 19, 2015 at 2:18 PM Andra Lungu <lungu.andra@gmail.com> wrote:

> Yes, so I am using flink-0.9.0-milestone-1. Was it a problem for this
> version?
> I'll just fetch the latest master if this is the case.
>
> On Fri, Jun 19, 2015 at 2:12 PM, Till Rohrmann <trohrmann@apache.org>
> wrote:
>
> > Hi Andra,
> >
> > the problem seems to be that the deployment of some tasks takes longer
> than
> > 100s. From the stack trace it looks as if you're not using the latest
> > master.
> >
> > We had problems with previous version where the deployment call waited
> for
> > the TM to completely download the user code jars. For large setups the
> > BlobServer became a bottleneck and some of the deployment calls timed
> out.
> > We updated the deployment logic so that the TM sends an immediate ACK
> backt
> > to the JM when it receives a new task.
> >
> > Could you verify which version of Flink you're running and in case that
> > it's not the latest master, could you please try to run your example with
> > the latest code?
> >
> > Cheers,
> > Till
> >
> > On Fri, Jun 19, 2015 at 1:42 PM Andra Lungu <lungu.andra@gmail.com>
> wrote:
> >
> > > Hi everyone,
> > >
> > > I ran a job this morning on 30 wally nodes. DOP 224. Worked like a
> charm.
> > >
> > > Then, I ran a similar job, on the exact same configuration, on the same
> > > input data set. The only difference between them is that the second job
> > > computes the degrees per vertex and, for vertices with degree higher
> > than a
> > > user-defined threshold, it does a bit of magic(roughly a bunch of
> > > coGroups). The problem is that, even before the extra functions get
> > called,
> > > I get the following type of exception:
> > >
> > > 06/19/2015 12:06:43     CHAIN FlatMap (FlatMap at
> > > fromDataSet(Graph.java:171)) -> Combine(Distinct at
> > > fromDataSet(Graph.java:171))(222/224) switched to FAILED
> > > java.lang.IllegalStateException: Update task on instance
> > > 29073fb0b0957198a2b67569b042d56b @ wally004 - 8 slots - URL:
> akka.tcp://
> > > flink@130.149.249.14:44528/user/taskmanager failed due to:
> > >         at
> > >
> > >
> >
> org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:860)
> > >         at akka.dispatch.OnFailure.internal(Future.scala:228)
> > >         at akka.dispatch.OnFailure.internal(Future.scala:227)
> > >         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
> > >         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
> > >         at
> > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> > >         at
> > >
> > >
> >
> scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
> > >         at
> > > scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
> > >         at
> > > scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
> > >         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> > >         at
> > >
> > >
> >
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
> > >         at
> > > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > >         at
> > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > >         at
> > >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > >         at
> > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > > Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> > > [Actor[akka.tcp://flink@130.149.249.14:44528/user/taskmanager#82700874
> ]]
> > > after [100000 ms]
> > >         at
> > >
> >
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
> > >         at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
> > >         at
> > >
> > >
> >
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
> > >         at
> > >
> >
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
> > >         at
> > >
> > >
> >
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
> > >         at
> > >
> > >
> >
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
> > >         at
> > >
> > >
> >
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
> > >         at
> > > akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
> > >         at java.lang.Thread.run(Thread.java:722)
> > >
> > >
> > >  At first I thought, okay maybe wally004 is down; then I ssh'd into it.
> > > Works fine.
> > >
> > > The full output can be found here:
> > > https://gist.github.com/andralungu/d222b75cb33aea57955d
> > >
> > > Does anyone have any idea about what may have triggered this? :(
> > >
> > > Thanks!
> > > Andra
> > >
> >
>

--047d7bb04272d686f60518de396e--