flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kruse, Sebastian" <Sebastian.Kr...@hpi.de>
Subject Heartbeat lost
Date Tue, 18 Nov 2014 08:46:31 GMT
Hi everyone,

In some of my jobs, I occasionally encounter the problem, that some of the task managers lose
the heartbeat connection to the job manager. The jobmanager did not crash, though. Here an
excerpt from the dashboard:

Error: java.lang.Exception: TaskManager lost heartbeat connection to JobManager
at org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbeatLoop(TaskManager.java:847)
at org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManager.java:109)
at org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.java:365)

I am not sure if this is a bug. I rather figure that the network or jobmanager workload is
too high, so that somehow the heartbeats do not arrive (on time), but that's a mere guess.
A first step for me could be to increase the heartbeat interval.

Has anyone of you encountered this problem or do you have any ideas on how to avoid this issue?

Thanks,
Sebastian

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message