Mailing-List: contact dev-help@flink.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of ewenstephan@gmail.com
 designates 209.85.220.169 as permitted sender)
MIME-Version: 1.0
Sender: ewenstephan@gmail.com
In-Reply-To: 
 <979DB9666496EA4AA881B4E447D3040001B29BF3@MXMA2012.hpi.uni-potsdam.de>
References: 
 <979DB9666496EA4AA881B4E447D3040001B2846C@MXMA2012.hpi.uni-potsdam.de>
	<CANC1h_uoQHzZ09bGBrKg+pEiDvrz2PnMP=fk2LcTtzAyMbvXVw@mail.gmail.com>
	<979DB9666496EA4AA881B4E447D3040001B2890E@MXMA2012.hpi.uni-potsdam.de>
	<CANC1h_sG-KQdiePKHcT3v=1S5osN828XxuoTOe6iOnvxadfn=A@mail.gmail.com>
	<979DB9666496EA4AA881B4E447D3040001B29BF3@MXMA2012.hpi.uni-potsdam.de>
Date: Wed, 19 Nov 2014 15:40:36 +0100
Message-ID: 
 <CANC1h_ur_Ash_LHv_+mTkTz1=+SFQSAVAeWrnfc5kMBu1dZpOg@mail.gmail.com>
Subject: Re: Heartbeat lost
From: Stephan Ewen <sewen@apache.org>
To: dev@flink.incubator.apache.org
Content-Type: multipart/alternative; boundary=089e0160c0bedc7a0005083731b7

--089e0160c0bedc7a0005083731b7
Content-Type: text/plain; charset=UTF-8

The mechanisms are different here: JobManager cares about time and discards
a TaskManager is the heartbeat was delayed long enough.

Delayed heartbeats are not a problem for the TaskManager - if the heartbeat
thread gets stuck, it gets stuck. Only seriously lost heartbeates cause a
problem, and that goes together with an IOException. The only other reason
for an unsuccessful heartbeat is that the JobManager rejected the heartbeat
because the delay has passed and the TaskManager has been marked as dead.

In that sense, the TaskManager respects the delay as well, unless network
problems occur. In that case, it fails earlier.

Do you actually experience these IOExceptions (in the log of the
TaskManager) ?


On Wed, Nov 19, 2014 at 2:49 PM, Kruse, Sebastian <Sebastian.Kruse@hpi.de>
wrote:

> To me, it looks like the
> "jobmanager.max-heartbeat-delay-before-failure.sec" is only used by the
> jobmanager to determine dead taskmanagers, but not vice versa. This is
> probably fine, because the parameter starts with "jobmanager". However, the
> number of missed heartbeats from the jobmanager to the taskmanager seems to
> be hard-wired to 3:
>
> TaskManager, ll.335ff.:
>
>                 // start the heart beats
>                 {
>                         final long interval =
> GlobalConfiguration.getInteger(
>
> ConfigConstants.TASK_MANAGER_HEARTBEAT_INTERVAL_KEY,
>
> ConfigConstants.DEFAULT_TASK_MANAGER_HEARTBEAT_INTERVAL);
>
>                         this.heartbeatThread = new Thread() {
>                                 @Override
>                                 public void run() {
>
> registerAndRunHeartbeatLoop(interval, MAX_LOST_HEART_BEATS);
>                                 }
>                         };
>                         this.heartbeatThread.setName("Heartbeat Thread");
>                         this.heartbeatThread.start();
>                 }
>
> Maybe, we should have a the
> "taskmanager.max-heartbeat-delay-before-failure.msec" as well.
>
> -----Original Message-----
> From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf Of
> Stephan Ewen
> Sent: Dienstag, 18. November 2014 14:08
> To: dev@flink.incubator.apache.org
> Subject: Re: Heartbeat lost
>
> The heartbeats currently go through the RPC service which is soon to be
> replaced by akka. So any fix there would be temporary.
>
> You can try increasing the thread priority, let us know if it works.
>
> Otherwise you can increase the heart beat timeout via
> "jobmanager.max-heartbeat-delay-before-failure.sec". CAREFUL: The keys says
> seconds, but the value is in milliseconds. We actually need to fix that
>
> Stephan
>
>
> On Tue, Nov 18, 2014 at 1:25 PM, Kruse, Sebastian <Sebastian.Kruse@hpi.de>
> wrote:
>
> > I am using the RemoteCollectorOutputFormat (if you recall, Fabian
> > Tschirschnitz contributed this) to send the output data to the driver
> > which happens to run on the same machine as the jobmanager. In some
> > cases, this output becomes huge, I assume this to be the problem.
> >
> > However, since the heartbeat runs in its own thread, we could assign
> > it a higher priority than regular driver/jobmanager code, to avoid the
> > suppression of heartbeats. Or do I miss something?
> >
> > Cheers,
> > Sebastian
> >
> > -----Original Message-----
> > From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf
> > Of Stephan Ewen
> > Sent: Dienstag, 18. November 2014 10:57
> > To: dev@flink.incubator.apache.org
> > Subject: Re: Heartbeat lost
> >
> > Yes, that sounds like a good idea.
> >
> > I have experienced that occasionally before, under high parallelism
> > and algorithms where the task manager got long garbage collection
> stalls...
> >
> > The default timeout (30 seconds) can be aggressive for sich jobs...
> >
> > Stephan
> > Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <Sebastian.Kruse@hpi.de>:
> >
> > > Hi everyone,
> > >
> > > In some of my jobs, I occasionally encounter the problem, that some
> > > of the task managers lose the heartbeat connection to the job manager.
> > > The jobmanager did not crash, though. Here an excerpt from the
> dashboard:
> > >
> > > Error: java.lang.Exception: TaskManager lost heartbeat connection to
> > > JobManager at
> > > org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeart
> > > be
> > > atLoop(TaskManager.java:847)
> > > at
> > > org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskMana
> > > ge
> > > r.java:109)
> > > at
> > > org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.j
> > > av
> > > a:365)
> > >
> > > I am not sure if this is a bug. I rather figure that the network or
> > > jobmanager workload is too high, so that somehow the heartbeats do
> > > not arrive (on time), but that's a mere guess. A first step for me
> > > could be to increase the heartbeat interval.
> > >
> > > Has anyone of you encountered this problem or do you have any ideas
> > > on how to avoid this issue?
> > >
> > > Thanks,
> > > Sebastian
> > >
> >
>

--089e0160c0bedc7a0005083731b7--