From user-return-32324-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Fri Jan 31 02:22:09 2020
Return-Path: <user-return-32324-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id DB77318062B
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 31 Jan 2020 03:22:08 +0100 (CET)
Received: (qmail 64313 invoked by uid 500); 31 Jan 2020 02:22:06 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 64302 invoked by uid 99); 31 Jan 2020 02:22:06 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Jan 2020 02:22:06 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0E67A180EE0
	for <user@flink.apache.org>; Fri, 31 Jan 2020 02:22:06 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level:
X-Spam-Status: No, score=0 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id mkfxNCQWT3Zy for <user@flink.apache.org>;
	Fri, 31 Jan 2020 02:22:03 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.221.68; helo=mail-wr1-f68.google.com; envelope-from=danrtsey.wy@gmail.com; receiver=<UNKNOWN> 
Received: from mail-wr1-f68.google.com (mail-wr1-f68.google.com [209.85.221.68])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 98C7DBB804
	for <user@flink.apache.org>; Fri, 31 Jan 2020 02:22:03 +0000 (UTC)
Received: by mail-wr1-f68.google.com with SMTP id w15so6716451wru.4
        for <user@flink.apache.org>; Thu, 30 Jan 2020 18:22:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=xyBtqFphroSCwXM+imDPstQrpP55P9iptagWtkRfZ2w=;
        b=hpWDcwqManaqjQCmDuXPFaYg+dBGccvqno3124gpy4GkNqE+3+3ERheewklW1yWB4S
         5kfCWa7+SJNy3zi3n63OjTg3C9iDma9DQqmsPGXXC0GnojaSZTOyb4Uh5TaQOQJPxd0q
         NbPov8/u4e8FDtas9I7SsN+EavR3J+NOmlH1JkPnxtF5uKe+G1/Rx8jg9PPsNZyMhMoB
         8jpIeH+fjt4wVfJ9EnxvgIAQnajLirTvaGg423B7yCIT+UgPuyoGBdrmAP4pEjRW4l/E
         LSByj4AY6pqfDjp82KwYIk0nPKcNaTdTmD3Z4PjKADC7i+GPj9OHD70uv+v6FXJ5k1zF
         P5Gw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=xyBtqFphroSCwXM+imDPstQrpP55P9iptagWtkRfZ2w=;
        b=B+YtwFvySDbqv8dwciJWshk8GfsKBpB7rZdfLSfJiDWkUU1NC5XQLIHthqNXqRPwBW
         SwtTalz6O5cL0N7OHA1Cc8N2WZv9ejlfM3RIOO2yqJfPSn+XxtK0CFnhhKmxg/vl+iYa
         9hFs5K00UjLQ3b3tKdobdJYMMT/88UnlMOfVupKINzKd57NBmUA6DmAxhqOWmuHFtWT1
         HYp8wK6oBvog/UrWy/2A+zsZYSCVoQFdakPglocIFTUG0Z3TKHdNu472LJRinY0LNykQ
         6MGTt01+OKhlvXAk8R6nk885ncJIkIEwBCuxPUa44RQEvwJvnp6yOJgpTTvqw8wB9HGb
         fRzw==
X-Gm-Message-State: APjAAAW3rEjnpikzCF4tXBrLKazq/USISkxMAns6nq+OKXvPsGXkE5vu
	C5uZO9A8ziajuuX8Ka/ahpEU0JnFc7lS8S7SqZo=
X-Google-Smtp-Source: APXvYqwjWGYH8X8w+v0PkMC/9sHrp9eU3E7D2c/BSts4iIVe/NR03k7wcWu0KY0FOCZgh4CQa9jjiaNAWGkL5PSF8vQ=
X-Received: by 2002:adf:fdc7:: with SMTP id i7mr8404430wrs.270.1580437322664;
 Thu, 30 Jan 2020 18:22:02 -0800 (PST)
MIME-Version: 1.0
References: <CAEjS3trwkrMPbZque9-8QBLnRVCEc91U41SjP4Q-q7ex2yA-ig@mail.gmail.com>
 <CY4PR20MB122369851A8BFD137E30467CDA040@CY4PR20MB1223.namprd20.prod.outlook.com>
 <CAEjS3to+XFdmzc4gw13+fdxfZ3d+bH0B6NHOp+M2DtjyVLCA-A@mail.gmail.com>
In-Reply-To: <CAEjS3to+XFdmzc4gw13+fdxfZ3d+bH0B6NHOp+M2DtjyVLCA-A@mail.gmail.com>
From: Yang Wang <danrtsey.wy@gmail.com>
Date: Fri, 31 Jan 2020 10:21:51 +0800
Message-ID: <CAP+gf37AnsjiFykE8HPW-JpoRjUXCeQg=qJDBa4mObMBTvauSg@mail.gmail.com>
Subject: Re: Task-manager kubernetes pods take a long time to terminate
To: Li Peng <li.peng@doordash.com>
Cc: Yun Tang <myasuka@live.com>, user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="000000000000350e9e059d663d1a"

--000000000000350e9e059d663d1a
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I think if you want to delete your Flink cluster on K8s, then you need to
directly delete all the
created deployments(jobmanager deploy, taskmanager deploy). For the
configmap and service,
you could leave them there if you want to reuse them by the next Flink
cluster deploy.

What's the status of taskmanager pod when you delete it and get stuck?


Best,
Yang

Li Peng <li.peng@doordash.com> =E4=BA=8E2020=E5=B9=B41=E6=9C=8831=E6=97=A5=
=E5=91=A8=E4=BA=94 =E4=B8=8A=E5=8D=884:51=E5=86=99=E9=81=93=EF=BC=9A

> Hi Yun,
>
> I'm currently specifying that specific RPC address in my kubernetes chart=
s
> for conveniene, should I be generating a new one for every deployment?
>
> And yes, I am deleting the pods using those commands, I'm just noticing
> that the task-manager termination process is short circuited by the
> registration timeout check, so that instead of terminating quickly, the
> task-manger would wait for 5 minutes to timeout before terminating. I'm
> expecting it to just terminate without doing that registration timeout, i=
s
> there a way to configure that?
>
> Thanks,
> Li
>
>
> On Thu, Jan 30, 2020 at 8:53 AM Yun Tang <myasuka@live.com> wrote:
>
>> Hi Li
>>
>> Why you still use =E2=80=99job-manager' as thejobmanager.rpc.address for=
 the
>> second new cluster? If you use another rpc address, previous task manage=
rs
>> would not try to register with old one.
>>
>> Take flink documentation [1] for k8s as example. You can list/delete all
>> pods like:
>>
>> kubectl get/delete pods -l app=3Dflink
>>
>>
>> By the way, the default registration timeout is 5min [2], those
>> taskmanager could not register to the JM will suicide after 5 minutes.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/ku=
bernetes.html#session-cluster-resource-definitions
>> [2]
>> https://github.com/apache/flink/blob/7e1a0f446e018681cb537dd936ae54388b5=
a7523/flink-core/src/main/java/org/apache/flink/configuration/TaskManagerOp=
tions.java#L158
>>
>> Best
>> Yun Tang
>>
>> ------------------------------
>> *From:* Li Peng <li.peng@doordash.com>
>> *Sent:* Thursday, January 30, 2020 9:24
>> *To:* user <user@flink.apache.org>
>> *Subject:* Task-manager kubernetes pods take a long time to terminate
>>
>> Hey folks, I'm deploying a Flink cluster via kubernetes, and starting
>> each task manager with taskmanager.sh. I noticed that when I tell kubect=
l
>> to delete the deployment, the job-manager pod usually terminates very
>> quickly, but any task-manager that doesn't get terminated before the
>> job-manager, usually gets stuck in this loop:
>>
>> 2020-01-29 09:18:47,867 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could n=
ot
>> resolve ResourceManager address akka.tcp://flink@job-manager:6123/user/r=
esourcemanager,
>> retrying in 10000 ms: Could not connect to rpc endpoint under address
>> akka.tcp://flink@job-manager:6123/user/resourcemanager
>>
>> It then does this for about 10 minutes(?), and then shuts down. If I'm
>> deploying a new cluster, this pod will try to register itself with the n=
ew
>> job manager before terminating lter. This isn't a troubling issue as far=
 as
>> I can tell, but I find it annoying that I sometimes have to force delete
>> the pods.
>>
>> Any easy ways to just have the task managers terminate gracefully and
>> quickly?
>>
>> Thanks,
>> Li
>>
>

--000000000000350e9e059d663d1a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I think if you want to delete your Flink cluster on K8s, t=
hen you need to directly delete all the<div>created deployments(jobmanager =
deploy, taskmanager deploy). For the configmap and service,</div><div>you c=
ould leave them there if you want to reuse them by the next Flink cluster d=
eploy.</div><div><br></div><div>What&#39;s the status of taskmanager pod wh=
en you delete it and get stuck?</div><div><br></div><div><br></div><div>Bes=
t,</div><div>Yang</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr=
" class=3D"gmail_attr">Li Peng &lt;<a href=3D"mailto:li.peng@doordash.com">=
li.peng@doordash.com</a>&gt; =E4=BA=8E2020=E5=B9=B41=E6=9C=8831=E6=97=A5=E5=
=91=A8=E4=BA=94 =E4=B8=8A=E5=8D=884:51=E5=86=99=E9=81=93=EF=BC=9A<br></div>=
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">Hi Yun,<=
div><br></div><div>I&#39;m currently specifying that specific RPC address i=
n my kubernetes charts for conveniene, should I be generating a new one for=
 every deployment?</div><div><br></div><div>And yes, I am deleting the pods=
 using those commands, I&#39;m just noticing that the task-manager terminat=
ion process is short circuited by the registration timeout check, so that i=
nstead of terminating quickly, the task-manger=C2=A0would wait for 5 minute=
s to timeout before terminating. I&#39;m expecting it to just terminate wit=
hout doing that registration timeout, is there a way to configure that?</di=
v><div><br></div><div>Thanks,</div><div>Li</div><div><br></div></div><br><d=
iv class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Jan =
30, 2020 at 8:53 AM Yun Tang &lt;<a href=3D"mailto:myasuka@live.com" target=
=3D"_blank">myasuka@live.com</a>&gt; wrote:<br></div><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex">




<div dir=3D"ltr">
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
Hi Li</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
<br>
</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
Why you still use =E2=80=99job-manager&#39; as the<code><span>jobmanager.rp=
c.address</span></code> for the second new cluster? If you use another rpc =
address, previous task managers would not try to register with old one.</di=
v>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
<br>
</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
Take flink documentation [1] for k8s as example. You can list/delete all po=
ds like:</div>
<blockquote style=3D"margin-top:0px;margin-bottom:0px">
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
<span style=3D"font-family:Consolas,Courier,monospace">kubectl get/delete p=
ods -l app=3Dflink</span></div>
</blockquote>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
<br>
</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
By the way, the default registration timeout is 5min [2], those taskmanager=
 could not register to the JM will suicide after 5 minutes.<br>
</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
<br>
</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
[1] <a href=3D"https://ci.apache.org/projects/flink/flink-docs-stable/ops/d=
eployment/kubernetes.html#session-cluster-resource-definitions" id=3D"gmail=
-m_6476302139172313328gmail-m_-119564962510389357LPNoLP390486" target=3D"_b=
lank">
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kuber=
netes.html#session-cluster-resource-definitions</a><br>
</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
[2] <a href=3D"https://github.com/apache/flink/blob/7e1a0f446e018681cb537dd=
936ae54388b5a7523/flink-core/src/main/java/org/apache/flink/configuration/T=
askManagerOptions.java#L158" id=3D"gmail-m_6476302139172313328gmail-m_-1195=
64962510389357LPNoLP954610" target=3D"_blank">
https://github.com/apache/flink/blob/7e1a0f446e018681cb537dd936ae54388b5a75=
23/flink-core/src/main/java/org/apache/flink/configuration/TaskManagerOptio=
ns.java#L158</a></div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
<br>
</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
Best</div>
<div style=3D"font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color=
:rgb(0,0,0);background-color:rgb(255,255,255)">
Yun Tang<br>
</div>
<br>
<div id=3D"gmail-m_6476302139172313328gmail-m_-119564962510389357appendonse=
nd"></div>
<hr style=3D"display:inline-block;width:98%">
<div id=3D"gmail-m_6476302139172313328gmail-m_-119564962510389357divRplyFwd=
Msg" dir=3D"ltr"><font face=3D"Calibri, sans-serif" style=3D"font-size:11pt=
" color=3D"#000000"><b>From:</b> Li Peng &lt;<a href=3D"mailto:li.peng@door=
dash.com" target=3D"_blank">li.peng@doordash.com</a>&gt;<br>
<b>Sent:</b> Thursday, January 30, 2020 9:24<br>
<b>To:</b> user &lt;<a href=3D"mailto:user@flink.apache.org" target=3D"_bla=
nk">user@flink.apache.org</a>&gt;<br>
<b>Subject:</b> Task-manager kubernetes pods take a long time to terminate<=
/font>
<div>=C2=A0</div>
</div>
<div>
<div dir=3D"ltr">Hey folks, I&#39;m deploying a Flink cluster via kubernete=
s, and starting each task manager with taskmanager.sh. I noticed that when =
I tell kubectl to delete the deployment, the job-manager pod usually termin=
ates very quickly, but any task-manager
 that doesn&#39;t get terminated before the job-manager, usually gets stuck=
 in this loop:<br>
<br>
2020-01-29 09:18:47,867 INFO =C2=A0org.apache.flink.runtime.taskexecutor.Ta=
skExecutor =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0- Could not resolve Res=
ourceManager address akka.tcp://flink@job-manager:6123/user/resourcemanager=
, retrying in 10000 ms: Could not connect to rpc endpoint under address
 akka.tcp://flink@job-manager:6123/user/resourcemanager<br>
<br>
It then does this for about 10 minutes(?), and then shuts down. If I&#39;m =
deploying a new cluster, this pod will try to register itself with the new =
job manager before terminating lter. This isn&#39;t a troubling issue as fa=
r as I can tell, but I find it annoying
 that I sometimes have to force delete the pods.=C2=A0<br>
<br>
Any easy ways to just have the task managers terminate gracefully and quick=
ly?<br>
<br>
Thanks,
<div>Li</div>
</div>
</div>
</div>

</blockquote></div>
</blockquote></div>

--000000000000350e9e059d663d1a--