flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Flink Runtime Exception
Date Fri, 19 Jun 2015 13:27:17 GMT
woops, sorry!
Whenever I read the word "deadlock" I getting a bit nervous and distracted
;-)

2015-06-19 15:21 GMT+02:00 Till Rohrmann <trohrmann@apache.org>:

> I think Andra wrote that there is *no deadlock*.
>
> On Fri, Jun 19, 2015 at 3:18 PM Fabian Hueske fhueske@gmail.com
> <http://mailto:fhueske@gmail.com> wrote:
>
> Hi Andra,
> >
> > The system should never deadlock.
> > There is a bug somewhere if that happens.
> >
> > Can you check if the program is really stuck?
> >
> > Cheers, Fabian
> >
> > 2015-06-19 15:08 GMT+02:00 Till Rohrmann <trohrmann@apache.org>:
> >
> > > What does forever mean? Usually it's the case that you see a steep
> > decline
> > > in performance once the system starts spilling data to disk because of
> > the
> > > disk I/O bottleneck.
> > >
> > > The system always starts spilling to disk when it has no more memory
> left
> > > for its operations. So for example if you want to sort data which
> cannot
> > be
> > > kept completely in memory, then the system has to employ external
> > sorting.
> > > If you can give Flink more memory, then you can avoid this problem
> > > depending on the actual data size.
> > >
> > > You can observe disk I/O by using the `iotop` command for example. If
> you
> > > see Flink having a high I/O usage, then this might be an indicator for
> > > spilling.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Jun 19, 2015 at 2:54 PM Andra Lungu <lungu.andra@gmail.com>
> > wrote:
> > >
> > > > Another problem that I encountered during the same set of
> > > experiments(sorry
> > > > if I am asking too many questions, I am eager to get things fixed):
> > > > - for the same configuration, a piece of code runs perfectly on 10GB
> of
> > > > input, then for 38GB it runs forever (no deadlock).
> > > >
> > > > I believe that may occur because Flink spills information to disk
> every
> > > > time it runs out of memory... Is this fixable by increasing the
> number
> > of
> > > > buffers?
> > > >
> > > > That's the last question for today, promise :)
> > > >
> > > > Thanks!
> > > >
> > > > On Fri, Jun 19, 2015 at 2:40 PM, Till Rohrmann <trohrmann@apache.org
> >
> > > > wrote:
> > > >
> > > > > Yes, it was an issue for the milestone release.
> > > > >
> > > > > On Fri, Jun 19, 2015 at 2:18 PM Andra Lungu <lungu.andra@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Yes, so I am using flink-0.9.0-milestone-1. Was it a problem
for
> > this
> > > > > > version?
> > > > > > I'll just fetch the latest master if this is the case.
> > > > > >
> > > > > > On Fri, Jun 19, 2015 at 2:12 PM, Till Rohrmann <
> > trohrmann@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Andra,
> > > > > > >
> > > > > > > the problem seems to be that the deployment of some tasks
takes
> > > > longer
> > > > > > than
> > > > > > > 100s. From the stack trace it looks as if you're not using
the
> > > latest
> > > > > > > master.
> > > > > > >
> > > > > > > We had problems with previous version where the deployment
call
> > > > waited
> > > > > > for
> > > > > > > the TM to completely download the user code jars. For large
> > setups
> > > > the
> > > > > > > BlobServer became a bottleneck and some of the deployment
calls
> > > timed
> > > > > > out.
> > > > > > > We updated the deployment logic so that the TM sends an
> immediate
> > > ACK
> > > > > > backt
> > > > > > > to the JM when it receives a new task.
> > > > > > >
> > > > > > > Could you verify which version of Flink you're running
and in
> > case
> > > > that
> > > > > > > it's not the latest master, could you please try to run
your
> > > example
> > > > > with
> > > > > > > the latest code?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Fri, Jun 19, 2015 at 1:42 PM Andra Lungu <
> > lungu.andra@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > I ran a job this morning on 30 wally nodes. DOP 224.
Worked
> > like
> > > a
> > > > > > charm.
> > > > > > > >
> > > > > > > > Then, I ran a similar job, on the exact same configuration,
> on
> > > the
> > > > > same
> > > > > > > > input data set. The only difference between them is
that the
> > > second
> > > > > job
> > > > > > > > computes the degrees per vertex and, for vertices
with degree
> > > > higher
> > > > > > > than a
> > > > > > > > user-defined threshold, it does a bit of magic(roughly
a
> bunch
> > of
> > > > > > > > coGroups). The problem is that, even before the extra
> functions
> > > get
> > > > > > > called,
> > > > > > > > I get the following type of exception:
> > > > > > > >
> > > > > > > > 06/19/2015 12:06:43     CHAIN FlatMap (FlatMap at
> > > > > > > > fromDataSet(Graph.java:171)) -> Combine(Distinct
at
> > > > > > > > fromDataSet(Graph.java:171))(222/224) switched to
FAILED
> > > > > > > > java.lang.IllegalStateException: Update task on instance
> > > > > > > > 29073fb0b0957198a2b67569b042d56b @ wally004 - 8 slots
- URL:
> > > > > > akka.tcp://
> > > > > > > > flink@130.149.249.14:44528/user/taskmanager failed
due to:
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:860)
> > > > > > > >         at akka.dispatch.OnFailure.internal(Future.scala:228)
> > > > > > > >         at akka.dispatch.OnFailure.internal(Future.scala:227)
> > > > > > > >         at
> > > > akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
> > > > > > > >         at
> > > > akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
> > > > > > > >         at
> > > > > > > >
> > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
> > > > > > > >         at
> > > > > > > >
> > > > scala.concurrent.Future$anonfun$onFailure$1.apply(Future.scala:136)
> > > > > > > >         at
> > > > > > > >
> > > > scala.concurrent.Future$anonfun$onFailure$1.apply(Future.scala:134)
> > > > > > > >         at
> > > > > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> scala.concurrent.impl.ExecutionContextImpl$anon$3.exec(ExecutionContextImpl.scala:107)
> > > > > > > >         at
> > > > > > > >
> > > > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > > > > > > > Caused by: akka.pattern.AskTimeoutException: Ask timed
out on
> > > > > > > > [Actor[akka.tcp://
> > > > > flink@130.149.249.14:44528/user/taskmanager#82700874
> > > > > > ]]
> > > > > > > > after [100000 ms]
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> akka.pattern.PromiseActorRef$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
> > > > > > > >         at
> akka.actor.Scheduler$anon$7.run(Scheduler.scala:117)
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$unbatchedExecute(Future.scala:694)
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> akka.actor.LightArrayRevolverScheduler$anon$8.executeBucket$1(Scheduler.scala:419)
> > > > > > > >         at
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> akka.actor.LightArrayRevolverScheduler$anon$8.nextTick(Scheduler.scala:423)
> > > > > > > >         at
> > > > > > > >
> > > > >
> > akka.actor.LightArrayRevolverScheduler$anon$8.run(Scheduler.scala:375)
> > > > > > > >         at java.lang.Thread.run(Thread.java:722)
> > > > > > > >
> > > > > > > >
> > > > > > > >  At first I thought, okay maybe wally004 is down;
then I
> ssh'd
> > > into
> > > > > it.
> > > > > > > > Works fine.
> > > > > > > >
> > > > > > > > The full output can be found here:
> > > > > > > > https://gist.github.com/andralungu/d222b75cb33aea57955d
> > > > > > > >
> > > > > > > > Does anyone have any idea about what may have triggered
this?
> > :(
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > Andra
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ‚Äč
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message