From user-return-22122-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Tue Aug 14 22:24:55 2018
Return-Path: <user-return-22122-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 809DB180654
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 14 Aug 2018 22:24:54 +0200 (CEST)
Received: (qmail 9893 invoked by uid 500); 14 Aug 2018 20:24:53 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 9877 invoked by uid 99); 14 Aug 2018 20:24:53 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Aug 2018 20:24:53 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9DDDB1A0F6E
	for <user@flink.apache.org>; Tue, 14 Aug 2018 20:24:52 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.739
X-Spam-Level: **
X-Spam-Status: No, score=2.739 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2,
	KAM_LOTSOFHASH=0.25, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key)
	header.d=data-artisans-com.20150623.gappssmtp.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id qw5lRlOt7UWa for <user@flink.apache.org>;
	Tue, 14 Aug 2018 20:24:50 +0000 (UTC)
Received: from mail-it0-f52.google.com (mail-it0-f52.google.com [209.85.214.52])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C89C45F41B
	for <user@flink.apache.org>; Tue, 14 Aug 2018 20:24:49 +0000 (UTC)
Received: by mail-it0-f52.google.com with SMTP id 72-v6so20197616itw.3
        for <user@flink.apache.org>; Tue, 14 Aug 2018 13:24:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=data-artisans-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc;
        bh=hYo0y86j8FGRccobQiQ7XZ0TU1PMkwqxp7FQs3oNm+Q=;
        b=CtsFvt6XgZvu22mz0LfqD/8MffR+7iyqJTwiMLhMQ0bjLNoXhsGFfaXUP0aiTIuHzh
         qPIMi9KmknbbMRYAa3LpirR9IqQDIZTvbMF+Sn7iDkzckkKezbcT+THYkm5NysQ5zYzq
         KqwS1eIJGMYHMy8mxiGX+oM/YCXOJ8Wy9R0dG6SDmr6DlX/LaZwE99p8B+Tlrt/hVbNT
         GrjkmVEC0Bk6qruVxmxGybyxIbHVe5+zoIK9ESo91EB0yBsvd4DHWGcZ4ebdnCKJR6zA
         NfqAOAEr391FdGqWSN/850Go2ZXP7lVyV/IqcU5goQgpGPJ/41h7MH+ylVPcmjOGk5ik
         Cpug==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=hYo0y86j8FGRccobQiQ7XZ0TU1PMkwqxp7FQs3oNm+Q=;
        b=O4XBWeNmiSotAcS0VjBVYrwafMArG5bO03EA5fZ+OwOiaa/piaEepHz4HFFDsTdHp7
         CgTI788ahC/XMlc4SZgLmNX4/G0zxOeba/45WUWKSP+ho2XdG4S08m5lSk2p0N4j2evA
         tUF46FAqvz+3H0hS15lUD4JGMKTxJcnYHkACsdMkkzKulsmHhZIFCcIXaQKwBKMD2UF4
         nZEuaqrQ0DR0K5mD2GaHnTOACux/mWDCZIGQJ30el2JOM6zHSW2CMYaYBxUK4PF2QsG/
         WoHKcbq6h/KjFo/6MldXVNi1Rd0kUGOivL44BeQdjDAuUxfpHyx3eiwazo8hqLhQmHeE
         GiPg==
X-Gm-Message-State: AOUpUlHmWvdJfXO7Nxs6q9qm9rANN3lAiwryDWzreIyP68WfmMBCVFuW
	zZ7RQYb/cAT60t1hSyLxOipzm55tWaVuXKeB00cL6g==
X-Google-Smtp-Source: AA+uWPzT8cgG1+61VrTkLOniFDuxkjhJM5oQN7f6Ld8USSdreGLarGbl3acXp6QzyD0PuCnFbKVKc6haNcUXOyUNxpE=
X-Received: by 2002:a02:3d58:: with SMTP id n85-v6mr19748756jan.99.1534278288582;
 Tue, 14 Aug 2018 13:24:48 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:ac0:92e2:0:0:0:0:0 with HTTP; Tue, 14 Aug 2018 13:24:47
 -0700 (PDT)
In-Reply-To: <CAMJEyBaMa_P0JeYDMj72zP6nRSXjL6OUEBwqSLcDtbB+BMRH0g@mail.gmail.com>
References: <CAMq=OU5413_-bRbtp2+=4dTOf3tRjk_Ssjhbrdy=UcOpsMnz_w@mail.gmail.com>
 <CAC2R2971ej05cuqy-RdkD0scXt_aEunuOL3yDMQ6ZHkAEpSswg@mail.gmail.com>
 <CAMq=OU7BOkX7rNvkrSWt-6k_TWsFoWKEbU8HyAF4N_KC=QxbUA@mail.gmail.com> <CAMJEyBaMa_P0JeYDMj72zP6nRSXjL6OUEBwqSLcDtbB+BMRH0g@mail.gmail.com>
From: Gary Yao <gary@data-artisans.com>
Date: Tue, 14 Aug 2018 22:24:47 +0200
Message-ID: <CAC2R296LdHnWQkNKNw2mQoTxdR_eOMWWaaxftWYNAy8e4zPj0Q@mail.gmail.com>
Subject: Re: 1.5.1
To: Juho Autio <juho.autio@rovio.com>
Cc: user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="00000000000060f56a05736b007a"

--00000000000060f56a05736b007a
Content-Type: text/plain; charset="UTF-8"

Hi Juho,

It seems in your case the JobMaster did not receive a heartbeat from the
TaskManager in time [1]. Heartbeat requests and answers are sent over the
RPC
framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.)
are
dispatched by a single thread. Therefore, the reasons for heartbeats
timeouts
include:

    1. The RPC threads of the TM or JM are blocked. In this case heartbeat
requests or answers cannot be dispatched.
    2. The scheduled task for sending the heartbeat requests [2] died.
    3. The network is flaky.

If you are confident that the network is not the culprit, I would suggest to
set the logging level to DEBUG, and look for periodic log messages (JM and
TM
logs) that are related to heartbeating. If the periodic log messages are
overdue, it is a hint that the main thread of the RPC endpoint is blocked
somewhere.

Best,
Gary

[1]
https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611
[2]
https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64

On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <juho.autio@rovio.com> wrote:

> I also have jobs failing on a daily basis with the error "Heartbeat of
> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>
> Could anyone suggest how to debug possible causes?
>
> I already set these in flink-conf.yaml, but I'm still getting failures:
> heartbeat.interval: 10000
> heartbeat.timeout: 100000
>
> Thanks.
>
> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <vishal.santoshi@gmail.com>
> wrote:
>
>> According to the UI it seems that "
>>
>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>
>> " was the cause of a pipe restart.
>>
>> As to the TM it is an artifact of the new job allocation regime which
>> will exhaust all slots on a TM rather then distributing them equitably.
>> TMs selectively are under more stress then in a pure RR distribution I
>> think. We may have to lower the slots on each TM to define a good upper
>> bound. You are correct 50s is a a pretty generous value.
>>
>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <gary@data-artisans.com> wrote:
>>
>>> Hi,
>>>
>>> The first exception should be only logged on info level. It's expected
>>> to see
>>> this exception when a TaskManager unregisters from the ResourceManager.
>>>
>>> Heartbeats can be configured via heartbeat.interval and hearbeat.timeout
>>> [1].
>>> The default timeout is 50s, which should be a generous value. It is
>>> probably a
>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>
>>> Best,
>>> Gary
>>>
>>> [1] https://ci.apache.org/projects/flink/flink-docs-
>>> release-1.5/ops/config.html#heartbeat-manager
>>>
>>>
>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>
>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>
>>>>
>>>> and
>>>>
>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>
>>>>
>>>> Not sure about the first but how do we increase the heartbeat interval
>>>> of a TM
>>>>
>>>> Thanks much
>>>>
>>>> Vishal
>>>>
>>>
>>>
>>
>

--00000000000060f56a05736b007a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Juho,<br><br>It seems in your case the JobMaster did no=
t receive a heartbeat from the<br>TaskManager in time [1]. Heartbeat reques=
ts and answers are sent over the RPC<br>framework, and RPCs of one componen=
t (e.g., TaskManager, JobMaster, etc.) are<br>dispatched by a single thread=
. Therefore, the reasons for heartbeats timeouts<br>include:<br><br>=C2=A0=
=C2=A0=C2=A0 1. The RPC threads of the TM or JM are blocked. In this case h=
eartbeat requests or answers cannot be dispatched.<br>=C2=A0=C2=A0=C2=A0 2.=
 The scheduled task for sending the heartbeat requests [2] died. <br>=C2=A0=
=C2=A0=C2=A0 3. The network is flaky.<br><br>If you are confident that the =
network is not the culprit, I would suggest to<br>set the logging level to =
DEBUG, and look for periodic log messages (JM and TM<br>logs) that are rela=
ted to heartbeating. If the periodic log messages are<br>overdue, it is a h=
int that the main thread of the RPC endpoint is blocked<br>somewhere.<br><b=
r>Best,<br>Gary<br><br>[1] <a href=3D"https://github.com/apache/flink/blob/=
release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaste=
r/JobMaster.java#L1611">https://github.com/apache/flink/blob/release-1.5.2/=
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.ja=
va#L1611</a><br>[2] <a href=3D"https://github.com/apache/flink/blob/913b041=
3882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/fl=
ink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64">https://github.c=
om/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime=
/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImp=
l.java#L64</a><br></div><div class=3D"gmail_extra"><br><div class=3D"gmail_=
quote">On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:juho.autio@rovio.com" target=3D"_blank">juho.autio@rovio.co=
m</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margi=
n:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">=
I also have jobs failing on a daily basis with the error &quot;Heartbeat of=
 TaskManager with id &lt;id&gt; timed out&quot;. I&#39;m using Flink 1.5.2.=
<div><br></div><div>Could anyone suggest how to debug possible causes?<div>=
<br></div><div>I already set these in flink-conf.yaml, but I&#39;m still ge=
tting failures:</div><div><div>heartbeat.interval: 10000<br></div><div>hear=
tbeat.timeout: 100000<br></div><div><br></div><div>Thanks.</div><div><div c=
lass=3D"h5"><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sun, Jul 22,=
 2018 at 2:20 PM Vishal Santoshi &lt;<a href=3D"mailto:vishal.santoshi@gmai=
l.com" target=3D"_blank">vishal.santoshi@gmail.com</a>&gt; wrote:<br></div>=
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">Accordin=
g to the UI it seems that &quot;<pre class=3D"m_-4483099107686108248gmail-m=
_276218883150300582gmail-m_7436210169278899007exception m_-4483099107686108=
248gmail-m_276218883150300582gmail-m_7436210169278899007gmail-ng-binding" s=
tyle=3D"white-space:pre-wrap;font-size:12.8px;background-color:transparent;=
text-decoration-style:initial;text-decoration-color:initial;box-sizing:bord=
er-box;overflow:auto;line-height:1.42857;display:block;color:rgb(51,51,51);=
font-family:Menlo,Monaco,Consolas,&quot;Courier New&quot;,monospace;padding=
:0px;margin:0px;word-break:break-all;word-wrap:break-word;border:none;borde=
r-radius:4px">org.apache.flink.util.<wbr>FlinkException: The assigned slot =
208af709ef7be2d2dfc028ba3bbf46<wbr>00_10 was removed.</pre>&quot; was the c=
ause of a pipe restart.<div><br></div><div>As to the TM it is an artifact o=
f the new job allocation regime which will exhaust all slots on a TM rather=
 then distributing them equitably.=C2=A0 TMs selectively are under more str=
ess then in a pure RR distribution I think. We may have to lower the slots =
on each TM to define a good upper bound. You are correct 50s is a a pretty =
generous value.</div></div><div class=3D"gmail_extra"><br><div class=3D"gma=
il_quote">On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <span dir=3D"ltr">&lt;<=
a href=3D"mailto:gary@data-artisans.com" target=3D"_blank">gary@data-artisa=
ns.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex"><div dir=3D"ltr">Hi,<br><br>The first exception should be only logge=
d on info level. It&#39;s expected to see<br>this exception when a TaskMana=
ger unregisters from the ResourceManager.<br><br>Heartbeats can be configur=
ed via heartbeat.interval and hearbeat.timeout [1].<br>The default timeout =
is 50s, which should be a generous value. It is probably a<br>good idea to =
find out why the heartbeats cannot be answered by the TM.<br><br>Best,<br>G=
ary<br><br>[1] <a href=3D"https://ci.apache.org/projects/flink/flink-docs-r=
elease-1.5/ops/config.html#heartbeat-manager" target=3D"_blank">https://ci.=
apache.org/<wbr>projects/flink/flink-docs-<wbr>release-1.5/ops/config.html#=
<wbr>heartbeat-manager</a><br><br></div><div class=3D"m_-448309910768610824=
8gmail-m_276218883150300582HOEnZb"><div class=3D"m_-4483099107686108248gmai=
l-m_276218883150300582h5"><div class=3D"gmail_extra"><br><div class=3D"gmai=
l_quote">On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:vishal.santoshi@gmail.com" target=3D"_blank">vishal.=
santoshi@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)=
;padding-left:1ex"><div dir=3D"ltr">2 issues we are seeing on 1.5.1 on a st=
reaming pipe line=C2=A0<div><br></div><div><pre class=3D"m_-448309910768610=
8248gmail-m_276218883150300582m_2867013618953928487m_7517371626847529210exc=
eption m_-4483099107686108248gmail-m_276218883150300582m_286701361895392848=
7m_7517371626847529210gmail-ng-binding" style=3D"box-sizing:border-box;over=
flow:auto;line-height:1.42857;display:block;color:rgb(51,51,51);font-family=
:Menlo,Monaco,Consolas,&quot;Courier New&quot;,monospace;padding:0px;margin=
:0px;word-break:break-all;word-wrap:break-word;background-color:transparent=
;border:none;border-radius:4px;white-space:pre-wrap;text-decoration-style:i=
nitial;text-decoration-color:initial">org.apache.flink.util.<wbr>FlinkExcep=
tion: The assigned slot 208af709ef7be2d2dfc028ba3bbf46<wbr>00_10 was remove=
d.</pre><br></div><div>and</div><div><br></div><div><pre class=3D"m_-448309=
9107686108248gmail-m_276218883150300582m_2867013618953928487m_7517371626847=
529210exception m_-4483099107686108248gmail-m_276218883150300582m_286701361=
8953928487m_7517371626847529210gmail-ng-binding" style=3D"box-sizing:border=
-box;overflow:auto;line-height:1.42857;display:block;color:rgb(51,51,51);fo=
nt-family:Menlo,Monaco,Consolas,&quot;Courier New&quot;,monospace;padding:0=
px;margin:0px;word-break:break-all;word-wrap:break-word;background-color:tr=
ansparent;border:none;border-radius:4px;white-space:pre-wrap;text-decoratio=
n-style:initial;text-decoration-color:initial">java.util.concurrent.<wbr>Ti=
meoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3b=
bf46<wbr>00 timed out.</pre><br></div><div>Not sure about the first but how=
 do we increase the heartbeat interval of a TM</div><div><br></div><div>Tha=
nks much=C2=A0</div><span class=3D"m_-4483099107686108248gmail-m_2762188831=
50300582m_2867013618953928487HOEnZb"><font color=3D"#888888"><div><br></div=
><div>Vishal</div></font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</blockquote></div><div><br></div></div></div></div></div></div>
</blockquote></div><br></div>

--00000000000060f56a05736b007a--