Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3BB8A10E7A for ; Wed, 19 Nov 2014 14:41:05 +0000 (UTC) Received: (qmail 27127 invoked by uid 500); 19 Nov 2014 14:41:05 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 27061 invoked by uid 500); 19 Nov 2014 14:41:05 -0000 Mailing-List: contact dev-help@flink.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.incubator.apache.org Delivered-To: mailing list dev@flink.incubator.apache.org Received: (qmail 27049 invoked by uid 99); 19 Nov 2014 14:41:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Nov 2014 14:41:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ewenstephan@gmail.com designates 209.85.220.169 as permitted sender) Received: from [209.85.220.169] (HELO mail-vc0-f169.google.com) (209.85.220.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Nov 2014 14:40:37 +0000 Received: by mail-vc0-f169.google.com with SMTP id hy10so357179vcb.0 for ; Wed, 19 Nov 2014 06:40:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=AH0tqapy8kQrEr7Klc/zLAOP3BsXIgFuD/mG695FirI=; b=sMYXMFFIllsYMn1ZSlQ/ylvG4vPkHTwNu6SwehKAzdHIRP0/XOPtxWlKSsbWCIwjAb yaB2KiMJlUv65YETtw0qY+4N+IIojlFEPmSyc+uEKySAWeShjEIVxvVJ0fgz6fbRLtmI GdkoBT23+BK0+22YnRm16+TIU3tM14ABt/KLZlzteXi+umQy/cDthA8bwT5zjuIbPBWG y7KFW76j/bJ0StF74tfI0bvSj1XliESijkbZ9Ly10qZ+gyQn4jCcHShNE/PKp9oqPeXZ nnHxnwWoz/vnaVrGv8U5oRKRBJ+4bQgJ++QXiRnW8KLGlC8sJrxoQq833QcqwPzrnQlU qUYg== MIME-Version: 1.0 X-Received: by 10.52.167.129 with SMTP id zo1mr32024606vdb.9.1416408036306; Wed, 19 Nov 2014 06:40:36 -0800 (PST) Sender: ewenstephan@gmail.com Received: by 10.31.161.80 with HTTP; Wed, 19 Nov 2014 06:40:36 -0800 (PST) In-Reply-To: <979DB9666496EA4AA881B4E447D3040001B29BF3@MXMA2012.hpi.uni-potsdam.de> References: <979DB9666496EA4AA881B4E447D3040001B2846C@MXMA2012.hpi.uni-potsdam.de> <979DB9666496EA4AA881B4E447D3040001B2890E@MXMA2012.hpi.uni-potsdam.de> <979DB9666496EA4AA881B4E447D3040001B29BF3@MXMA2012.hpi.uni-potsdam.de> Date: Wed, 19 Nov 2014 15:40:36 +0100 X-Google-Sender-Auth: -eWSGY9leT3IgR6y18DwW-otUbM Message-ID: Subject: Re: Heartbeat lost From: Stephan Ewen To: dev@flink.incubator.apache.org Content-Type: multipart/alternative; boundary=089e0160c0bedc7a0005083731b7 X-Virus-Checked: Checked by ClamAV on apache.org --089e0160c0bedc7a0005083731b7 Content-Type: text/plain; charset=UTF-8 The mechanisms are different here: JobManager cares about time and discards a TaskManager is the heartbeat was delayed long enough. Delayed heartbeats are not a problem for the TaskManager - if the heartbeat thread gets stuck, it gets stuck. Only seriously lost heartbeates cause a problem, and that goes together with an IOException. The only other reason for an unsuccessful heartbeat is that the JobManager rejected the heartbeat because the delay has passed and the TaskManager has been marked as dead. In that sense, the TaskManager respects the delay as well, unless network problems occur. In that case, it fails earlier. Do you actually experience these IOExceptions (in the log of the TaskManager) ? On Wed, Nov 19, 2014 at 2:49 PM, Kruse, Sebastian wrote: > To me, it looks like the > "jobmanager.max-heartbeat-delay-before-failure.sec" is only used by the > jobmanager to determine dead taskmanagers, but not vice versa. This is > probably fine, because the parameter starts with "jobmanager". However, the > number of missed heartbeats from the jobmanager to the taskmanager seems to > be hard-wired to 3: > > TaskManager, ll.335ff.: > > // start the heart beats > { > final long interval = > GlobalConfiguration.getInteger( > > ConfigConstants.TASK_MANAGER_HEARTBEAT_INTERVAL_KEY, > > ConfigConstants.DEFAULT_TASK_MANAGER_HEARTBEAT_INTERVAL); > > this.heartbeatThread = new Thread() { > @Override > public void run() { > > registerAndRunHeartbeatLoop(interval, MAX_LOST_HEART_BEATS); > } > }; > this.heartbeatThread.setName("Heartbeat Thread"); > this.heartbeatThread.start(); > } > > Maybe, we should have a the > "taskmanager.max-heartbeat-delay-before-failure.msec" as well. > > -----Original Message----- > From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf Of > Stephan Ewen > Sent: Dienstag, 18. November 2014 14:08 > To: dev@flink.incubator.apache.org > Subject: Re: Heartbeat lost > > The heartbeats currently go through the RPC service which is soon to be > replaced by akka. So any fix there would be temporary. > > You can try increasing the thread priority, let us know if it works. > > Otherwise you can increase the heart beat timeout via > "jobmanager.max-heartbeat-delay-before-failure.sec". CAREFUL: The keys says > seconds, but the value is in milliseconds. We actually need to fix that > > Stephan > > > On Tue, Nov 18, 2014 at 1:25 PM, Kruse, Sebastian > wrote: > > > I am using the RemoteCollectorOutputFormat (if you recall, Fabian > > Tschirschnitz contributed this) to send the output data to the driver > > which happens to run on the same machine as the jobmanager. In some > > cases, this output becomes huge, I assume this to be the problem. > > > > However, since the heartbeat runs in its own thread, we could assign > > it a higher priority than regular driver/jobmanager code, to avoid the > > suppression of heartbeats. Or do I miss something? > > > > Cheers, > > Sebastian > > > > -----Original Message----- > > From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf > > Of Stephan Ewen > > Sent: Dienstag, 18. November 2014 10:57 > > To: dev@flink.incubator.apache.org > > Subject: Re: Heartbeat lost > > > > Yes, that sounds like a good idea. > > > > I have experienced that occasionally before, under high parallelism > > and algorithms where the task manager got long garbage collection > stalls... > > > > The default timeout (30 seconds) can be aggressive for sich jobs... > > > > Stephan > > Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" : > > > > > Hi everyone, > > > > > > In some of my jobs, I occasionally encounter the problem, that some > > > of the task managers lose the heartbeat connection to the job manager. > > > The jobmanager did not crash, though. Here an excerpt from the > dashboard: > > > > > > Error: java.lang.Exception: TaskManager lost heartbeat connection to > > > JobManager at > > > org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeart > > > be > > > atLoop(TaskManager.java:847) > > > at > > > org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskMana > > > ge > > > r.java:109) > > > at > > > org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.j > > > av > > > a:365) > > > > > > I am not sure if this is a bug. I rather figure that the network or > > > jobmanager workload is too high, so that somehow the heartbeats do > > > not arrive (on time), but that's a mere guess. A first step for me > > > could be to increase the heartbeat interval. > > > > > > Has anyone of you encountered this problem or do you have any ideas > > > on how to avoid this issue? > > > > > > Thanks, > > > Sebastian > > > > > > --089e0160c0bedc7a0005083731b7--