Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C43AC129 for ; Tue, 18 Nov 2014 13:08:42 +0000 (UTC) Received: (qmail 31081 invoked by uid 500); 18 Nov 2014 13:08:42 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 31030 invoked by uid 500); 18 Nov 2014 13:08:42 -0000 Mailing-List: contact dev-help@flink.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.incubator.apache.org Delivered-To: mailing list dev@flink.incubator.apache.org Received: (qmail 31017 invoked by uid 99); 18 Nov 2014 13:08:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2014 13:08:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ewenstephan@gmail.com designates 209.85.220.181 as permitted sender) Received: from [209.85.220.181] (HELO mail-vc0-f181.google.com) (209.85.220.181) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2014 13:08:37 +0000 Received: by mail-vc0-f181.google.com with SMTP id le20so5764817vcb.40 for ; Tue, 18 Nov 2014 05:07:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=+1L2UHzlhdYWf95pF2jDQBilXG5Gc/dUJkC4fsJpsD8=; b=k8wUN4x/jz+CEhhEVrkCPboaDPL0Yqk/+wSXg5ivw5obY8S4kwwQVErN1xyRCA+qYi ZY2B5R3Jjg3XjX3r2Ia55RNUjLttTDExrs+blIgq3Fmxv+yYXL9grSCgcylpBPOzwZo1 UtMbUvbpANKeCxzNPErjQhYSjsG6K7ZEscl+S9r/dycYqkbHc/MnBXZf+ut+brDh2kju U0c43+PhBmc1Xt8QzrFasyFBfYujaOVXTYMmjaE/YlftR+gfut6s4hwuI1GYCBPWnyi2 PXFiaISsgIWKmjX8/S1AA3lJicKaNKUBBD5ljG2F3CZoVRgEkkKJ+CPnElsObKMDdhEx qQiA== MIME-Version: 1.0 X-Received: by 10.52.244.78 with SMTP id xe14mr26939567vdc.6.1416316051734; Tue, 18 Nov 2014 05:07:31 -0800 (PST) Sender: ewenstephan@gmail.com Received: by 10.31.161.80 with HTTP; Tue, 18 Nov 2014 05:07:31 -0800 (PST) In-Reply-To: <979DB9666496EA4AA881B4E447D3040001B2890E@MXMA2012.hpi.uni-potsdam.de> References: <979DB9666496EA4AA881B4E447D3040001B2846C@MXMA2012.hpi.uni-potsdam.de> <979DB9666496EA4AA881B4E447D3040001B2890E@MXMA2012.hpi.uni-potsdam.de> Date: Tue, 18 Nov 2014 14:07:31 +0100 X-Google-Sender-Auth: Iucj1Y0CyKitUf-pn5hLgDT87RI Message-ID: Subject: Re: Heartbeat lost From: Stephan Ewen To: dev@flink.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11c2487a2747f7050821c781 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2487a2747f7050821c781 Content-Type: text/plain; charset=UTF-8 The heartbeats currently go through the RPC service which is soon to be replaced by akka. So any fix there would be temporary. You can try increasing the thread priority, let us know if it works. Otherwise you can increase the heart beat timeout via "jobmanager.max-heartbeat-delay-before-failure.sec". CAREFUL: The keys says seconds, but the value is in milliseconds. We actually need to fix that Stephan On Tue, Nov 18, 2014 at 1:25 PM, Kruse, Sebastian wrote: > I am using the RemoteCollectorOutputFormat (if you recall, Fabian > Tschirschnitz contributed this) to send the output data to the driver which > happens to run on the same machine as the jobmanager. In some cases, this > output becomes huge, I assume this to be the problem. > > However, since the heartbeat runs in its own thread, we could assign it a > higher priority than regular driver/jobmanager code, to avoid the > suppression of heartbeats. Or do I miss something? > > Cheers, > Sebastian > > -----Original Message----- > From: ewenstephan@gmail.com [mailto:ewenstephan@gmail.com] On Behalf Of > Stephan Ewen > Sent: Dienstag, 18. November 2014 10:57 > To: dev@flink.incubator.apache.org > Subject: Re: Heartbeat lost > > Yes, that sounds like a good idea. > > I have experienced that occasionally before, under high parallelism and > algorithms where the task manager got long garbage collection stalls... > > The default timeout (30 seconds) can be aggressive for sich jobs... > > Stephan > Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" : > > > Hi everyone, > > > > In some of my jobs, I occasionally encounter the problem, that some of > > the task managers lose the heartbeat connection to the job manager. > > The jobmanager did not crash, though. Here an excerpt from the dashboard: > > > > Error: java.lang.Exception: TaskManager lost heartbeat connection to > > JobManager at > > org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbe > > atLoop(TaskManager.java:847) > > at > > org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManage > > r.java:109) > > at > > org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.jav > > a:365) > > > > I am not sure if this is a bug. I rather figure that the network or > > jobmanager workload is too high, so that somehow the heartbeats do not > > arrive (on time), but that's a mere guess. A first step for me could > > be to increase the heartbeat interval. > > > > Has anyone of you encountered this problem or do you have any ideas on > > how to avoid this issue? > > > > Thanks, > > Sebastian > > > --001a11c2487a2747f7050821c781--