flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Heartbeat lost
Date Tue, 18 Nov 2014 16:27:17 GMT
Have you evaluated to adopt reactor instead of akka?
On Nov 18, 2014 10:57 AM, "Stephan Ewen" <sewen@apache.org> wrote:

> Yes, that sounds like a good idea.
>
> I have experienced that occasionally before, under high parallelism and
> algorithms where the task manager got long garbage collection stalls...
>
> The default timeout (30 seconds) can be aggressive for sich jobs...
>
> Stephan
> Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <Sebastian.Kruse@hpi.de>:
>
> > Hi everyone,
> >
> > In some of my jobs, I occasionally encounter the problem, that some of
> the
> > task managers lose the heartbeat connection to the job manager. The
> > jobmanager did not crash, though. Here an excerpt from the dashboard:
> >
> > Error: java.lang.Exception: TaskManager lost heartbeat connection to
> > JobManager
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbeatLoop(TaskManager.java:847)
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManager.java:109)
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.java:365)
> >
> > I am not sure if this is a bug. I rather figure that the network or
> > jobmanager workload is too high, so that somehow the heartbeats do not
> > arrive (on time), but that's a mere guess. A first step for me could be
> to
> > increase the heartbeat interval.
> >
> > Has anyone of you encountered this problem or do you have any ideas on
> how
> > to avoid this issue?
> >
> > Thanks,
> > Sebastian
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message