From user-return-22063-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Mon Aug 13 09:52:22 2018
Return-Path: <user-return-22063-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id D0EEE180629
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 13 Aug 2018 09:52:21 +0200 (CEST)
Received: (qmail 70646 invoked by uid 500); 13 Aug 2018 07:52:17 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 70636 invoked by uid 99); 13 Aug 2018 07:52:17 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2018 07:52:17 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0FA68C98D0
	for <user@flink.apache.org>; Mon, 13 Aug 2018 07:52:17 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.399
X-Spam-Level: **
X-Spam-Status: No, score=2.399 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (1024-bit key) header.d=rovio.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id MJYzpaivwGjI for <user@flink.apache.org>;
	Mon, 13 Aug 2018 07:52:15 +0000 (UTC)
Received: from mail-wm0-f46.google.com (mail-wm0-f46.google.com [74.125.82.46])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 9A8EF5F42F
	for <user@flink.apache.org>; Mon, 13 Aug 2018 07:52:15 +0000 (UTC)
Received: by mail-wm0-f46.google.com with SMTP id c14-v6so7836921wmb.4
        for <user@flink.apache.org>; Mon, 13 Aug 2018 00:52:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=rovio.com; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=w/wYGKtm9NHeHxr+XDMCRzKSYKdujII/B4XdXo+cjmw=;
        b=hrk4pY01iwhkmqT9WbpLdv0+kpleQ1436mzWMseeCg2U7H+/9/2XpLIuKc6LvbHD1V
         PwVoB6xMTboRglzwkA5OecL6sMgNmtZbBQUrpofztsvQ6Gs5Xr7R2Sv4k/JRJ3L0TLYu
         8e7YaQ5JaA4R7NzEptc6uaYHna9RYtI3Q9noU=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=w/wYGKtm9NHeHxr+XDMCRzKSYKdujII/B4XdXo+cjmw=;
        b=D5dtsP0sqUYE0IMEE1LPqK4hdSpWoCeOU4tihpLRuhxvwxb+IYJInA+2LD/tLbTO3y
         eyXSQ7b4aVfI3+b+bxlJ+VmiEofmlHFunQGsEPcK1vpt22UG27xoFe+OlPjkTseqyQCK
         olLYh53bNvAbC7RVj2iwjbFSTaUC927HHDL4JXIhLW7H1dz+evEk16ypnPx4p6vPRos4
         vGsYvGo52qBxUjt8JfoJ2Tgm1QF6fjxGbu4U4K3mUbFFNy+RjuuUoZexdKVPsUCk9Kig
         YtQ/vgMQ0WUQqSKYHc8xLPoWcEldGJvX4MH3z+Qwy3qAsiKfo9DycBT8EAvcW4XZjcdq
         bsuw==
X-Gm-Message-State: AOUpUlHZXYYog6lAsJ/Ir2imYY7EXrgwgFLxpxBSAN7mHgaqSJ2i8ue/
	sgKmI+F3vCNghEksj1R59GxEx2XOljBpKaEmUY7QXg==
X-Google-Smtp-Source: AA+uWPzM046hefdoqikDThOD4ZHQSShJYRwUTLM6VYCK4/xtrJ+yyq2nFliTLQVNQcR7WGjyFXHa9vyRaEoAP8KYlYk=
X-Received: by 2002:a1c:497:: with SMTP id 145-v6mr7616896wme.157.1534146735178;
 Mon, 13 Aug 2018 00:52:15 -0700 (PDT)
MIME-Version: 1.0
References: <CAMq=OU5413_-bRbtp2+=4dTOf3tRjk_Ssjhbrdy=UcOpsMnz_w@mail.gmail.com>
 <CAC2R2971ej05cuqy-RdkD0scXt_aEunuOL3yDMQ6ZHkAEpSswg@mail.gmail.com> <CAMq=OU7BOkX7rNvkrSWt-6k_TWsFoWKEbU8HyAF4N_KC=QxbUA@mail.gmail.com>
In-Reply-To: <CAMq=OU7BOkX7rNvkrSWt-6k_TWsFoWKEbU8HyAF4N_KC=QxbUA@mail.gmail.com>
From: Juho Autio <juho.autio@rovio.com>
Date: Mon, 13 Aug 2018 10:52:04 +0300
Message-ID: <CAMJEyBaMa_P0JeYDMj72zP6nRSXjL6OUEBwqSLcDtbB+BMRH0g@mail.gmail.com>
Subject: Re: 1.5.1
To: vishal.santoshi@gmail.com
Cc: Gary Yao <gary@data-artisans.com>, user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="0000000000002f202905734c5fe1"

--0000000000002f202905734c5fe1
Content-Type: text/plain; charset="UTF-8"

I also have jobs failing on a daily basis with the error "Heartbeat of
TaskManager with id <id> timed out". I'm using Flink 1.5.2.

Could anyone suggest how to debug possible causes?

I already set these in flink-conf.yaml, but I'm still getting failures:
heartbeat.interval: 10000
heartbeat.timeout: 100000

Thanks.

On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <vishal.santoshi@gmail.com>
wrote:

> According to the UI it seems that "
>
> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>
> " was the cause of a pipe restart.
>
> As to the TM it is an artifact of the new job allocation regime which will
> exhaust all slots on a TM rather then distributing them equitably.  TMs
> selectively are under more stress then in a pure RR distribution I think.
> We may have to lower the slots on each TM to define a good upper bound. You
> are correct 50s is a a pretty generous value.
>
> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <gary@data-artisans.com> wrote:
>
>> Hi,
>>
>> The first exception should be only logged on info level. It's expected to
>> see
>> this exception when a TaskManager unregisters from the ResourceManager.
>>
>> Heartbeats can be configured via heartbeat.interval and hearbeat.timeout
>> [1].
>> The default timeout is 50s, which should be a generous value. It is
>> probably a
>> good idea to find out why the heartbeats cannot be answered by the TM.
>>
>> Best,
>> Gary
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>
>>
>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>
>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>
>>>
>>> and
>>>
>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>
>>>
>>> Not sure about the first but how do we increase the heartbeat interval
>>> of a TM
>>>
>>> Thanks much
>>>
>>> Vishal
>>>
>>
>>
>

--0000000000002f202905734c5fe1
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I also have jobs failing on a daily basis with the error &=
quot;Heartbeat of TaskManager with id &lt;id&gt; timed out&quot;. I&#39;m u=
sing Flink 1.5.2.<div><br></div><div>Could anyone suggest how to debug poss=
ible causes?<div><br></div><div>I already set these in flink-conf.yaml, but=
 I&#39;m still getting failures:</div><div><div>heartbeat.interval: 10000<b=
r></div><div>heartbeat.timeout: 100000<br></div><div><br></div><div>Thanks.=
</div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sun, Jul 22, 2018 =
at 2:20 PM Vishal Santoshi &lt;<a href=3D"mailto:vishal.santoshi@gmail.com"=
>vishal.santoshi@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex"><div dir=3D"ltr">According to the UI it seems tha=
t &quot;<pre class=3D"gmail-m_276218883150300582gmail-m_7436210169278899007=
exception gmail-m_276218883150300582gmail-m_7436210169278899007gmail-ng-bin=
ding" style=3D"white-space:pre-wrap;font-size:12.8px;background-color:trans=
parent;text-decoration-style:initial;text-decoration-color:initial;box-sizi=
ng:border-box;overflow:auto;line-height:1.42857;display:block;color:rgb(51,=
51,51);font-family:Menlo,Monaco,Consolas,&quot;Courier New&quot;,monospace;=
padding:0px;margin:0px;word-break:break-all;word-wrap:break-word;border:non=
e;border-radius:4px">org.apache.flink.util.FlinkException: The assigned slo=
t 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.</pre>&quot; was the caus=
e of a pipe restart.<div><br></div><div>As to the TM it is an artifact of t=
he new job allocation regime which will exhaust all slots on a TM rather th=
en distributing them equitably.=C2=A0 TMs selectively are under more stress=
 then in a pure RR distribution I think. We may have to lower the slots on =
each TM to define a good upper bound. You are correct 50s is a a pretty gen=
erous value.</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_=
quote">On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:gary@data-artisans.com" target=3D"_blank">gary@data-artisans.=
com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1=
ex"><div dir=3D"ltr">Hi,<br><br>The first exception should be only logged o=
n info level. It&#39;s expected to see<br>this exception when a TaskManager=
 unregisters from the ResourceManager.<br><br>Heartbeats can be configured =
via heartbeat.interval and hearbeat.timeout [1].<br>The default timeout is =
50s, which should be a generous value. It is probably a<br>good idea to fin=
d out why the heartbeats cannot be answered by the TM.<br><br>Best,<br>Gary=
<br><br>[1] <a href=3D"https://ci.apache.org/projects/flink/flink-docs-rele=
ase-1.5/ops/config.html#heartbeat-manager" target=3D"_blank">https://ci.apa=
che.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-man=
ager</a><br><br></div><div class=3D"gmail-m_276218883150300582HOEnZb"><div =
class=3D"gmail-m_276218883150300582h5"><div class=3D"gmail_extra"><br><div =
class=3D"gmail_quote">On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" target=3D"_b=
lank">vishal.santoshi@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex"><div dir=3D"ltr">2 issues we are seeing on=
 1.5.1 on a streaming pipe line=C2=A0<div><br></div><div><pre class=3D"gmai=
l-m_276218883150300582m_2867013618953928487m_7517371626847529210exception g=
mail-m_276218883150300582m_2867013618953928487m_7517371626847529210gmail-ng=
-binding" style=3D"box-sizing:border-box;overflow:auto;line-height:1.42857;=
display:block;color:rgb(51,51,51);font-family:Menlo,Monaco,Consolas,&quot;C=
ourier New&quot;,monospace;padding:0px;margin:0px;word-break:break-all;word=
-wrap:break-word;background-color:transparent;border:none;border-radius:4px=
;white-space:pre-wrap;text-decoration-style:initial;text-decoration-color:i=
nitial">org.apache.flink.util.FlinkException: The assigned slot 208af709ef7=
be2d2dfc028ba3bbf4600_10 was removed.</pre><br></div><div>and</div><div><br=
></div><div><pre class=3D"gmail-m_276218883150300582m_2867013618953928487m_=
7517371626847529210exception gmail-m_276218883150300582m_286701361895392848=
7m_7517371626847529210gmail-ng-binding" style=3D"box-sizing:border-box;over=
flow:auto;line-height:1.42857;display:block;color:rgb(51,51,51);font-family=
:Menlo,Monaco,Consolas,&quot;Courier New&quot;,monospace;padding:0px;margin=
:0px;word-break:break-all;word-wrap:break-word;background-color:transparent=
;border:none;border-radius:4px;white-space:pre-wrap;text-decoration-style:i=
nitial;text-decoration-color:initial">java.util.concurrent.TimeoutException=
: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed o=
ut.</pre><br></div><div>Not sure about the first but how do we increase the=
 heartbeat interval of a TM</div><div><br></div><div>Thanks much=C2=A0</div=
><span class=3D"gmail-m_276218883150300582m_2867013618953928487HOEnZb"><fon=
t color=3D"#888888"><div><br></div><div>Vishal</div></font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</blockquote></div><div><br></div></div></div></div>

--0000000000002f202905734c5fe1--