From user-return-29735-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Wed Sep 11 09:11:19 2019
Return-Path: <user-return-29735-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 8AFAC18063F
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 11 Sep 2019 11:11:19 +0200 (CEST)
Received: (qmail 21038 invoked by uid 500); 11 Sep 2019 09:11:18 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 21028 invoked by uid 99); 11 Sep 2019 09:11:18 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Sep 2019 09:11:18 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7FCEFC00C6
	for <user@flink.apache.org>; Wed, 11 Sep 2019 09:11:17 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.55
X-Spam-Level: **
X-Spam-Status: No, score=2.55 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2,
	KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
	SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-he-de.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id zxkTe1WuA4Zt for <user@flink.apache.org>;
	Wed, 11 Sep 2019 09:11:15 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::e44; helo=mail-vs1-xe44.google.com; envelope-from=huanyang1024@gmail.com; receiver=<UNKNOWN> 
Received: from mail-vs1-xe44.google.com (mail-vs1-xe44.google.com [IPv6:2607:f8b0:4864:20::e44])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id C5D357D51C
	for <user@flink.apache.org>; Wed, 11 Sep 2019 09:11:14 +0000 (UTC)
Received: by mail-vs1-xe44.google.com with SMTP id v19so8103723vsv.3
        for <user@flink.apache.org>; Wed, 11 Sep 2019 02:11:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=GB6clVrtZ/La9k4aduToiYP3YiVvsn2mF0qXtg3yhrE=;
        b=OjI+aAHIiISUxyTTQt1N8tcYPIUfMMEyc4Qpt1oCwixzGjqAHSq7RXgmVM4wlSSTJ0
         hoyxD/16nckb0zZ5LGLUp82+siA/1nam94gMNQMcv6bwOWLO1Yv6Mv9QvfdWicM4Mb7/
         VW2V9bSiAMRUhBnUPXRtDbppSzqOu0YnLjkRpqkG81kKWq2tB7yEs7DzzFIALHgJyTx4
         HjLOdPOQHL5tFQQ2Nq6pOtA24OJrtl+bOw5+jbdu8BLasIOZ9og3vwlSh0sF+ig20cVf
         HvurCvKqtBaCBnjT0oVv9JXVR2nRrWEIztqWTtyTqkg1J8OBP0/fcg8PRZhJrzNrsI8c
         VKBQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=GB6clVrtZ/La9k4aduToiYP3YiVvsn2mF0qXtg3yhrE=;
        b=WMlyl4ES6ndQZSDaEZTduj8TvgBEhC0Ldo3BwspMmcX3BpWRhHneZHzaeEDNdOas3U
         6P0H1JdIwMfVNVtdHBUdFmPlyhUAPeYSnb/bMh9SF8M5rDv+YU5gOMetmIQB38E+Hniy
         LHWvWlOem2HAURu/zl4Ii8zZXc7iU/QIwMec9SrVA2d4RW5HfeNvE3COOlrBEVDaGyKp
         iy18DN/h6daD/WBLad40JnDsvPduKR45VWmp0vECVDDfSizns9fxjvyMxNaC7IR8fQHv
         d0BnZiZxQWO3Jm/a2iXY4+gS4l1KfdQuBji76t5ni86Vkk9jGQXHKsQ7wJScAhA22t+9
         tieA==
X-Gm-Message-State: APjAAAU7OPQBKALOCF/whC4R6E0Aj0+cbpMbNqbz/LSY66SzQjO3EL9d
	glze1UbYkvouIEQQmr/VIfzmKmtVOyZnkEXongs=
X-Google-Smtp-Source: APXvYqxvF3LVXxxw4bsnATPRBflssFp1h1QcGZe9a83pf4DQpNYyxwo+qgIWmfwipxN0zPLrv017Pq8KnrQoeDcVDFg=
X-Received: by 2002:a67:e20e:: with SMTP id g14mr19692553vsa.149.1568193073666;
 Wed, 11 Sep 2019 02:11:13 -0700 (PDT)
MIME-Version: 1.0
References: <CAKvbZNVUvaZgQo8EZv_dksriwPto2Mk4s5d=9XWqR=f2LVD2JQ@mail.gmail.com>
 <CAK-Ni0doasnrPFa75Ud6XFbjksKug4HCtkNH1K2hZLCUjcHSgg@mail.gmail.com>
 <CAKvbZNXcLrdO9wG=cJX+EYfxhn5pgAO5PwtZ5B-xvFDeVxmNkg@mail.gmail.com>
 <CAJNyZN5PLzYt8A36BT+3eaQ1DEY7h2afc49bCzLDDyAU6hszhg@mail.gmail.com>
 <CAKvbZNVAXMSagGH3eD-DLTWLiWo1TrzCo6FUpnqfbJU1P85Q6Q@mail.gmail.com>
 <CAK-Ni0fXSDcrPDQ2BGgZU2K2pAiw4rhgErDG07B58rHEfuCZdw@mail.gmail.com>
 <CAKvbZNVn5QUy9j89X6yqSPrmCEgtdbqB7eBnJVanEaG=61kQYQ@mail.gmail.com> <CAJNyZN6_6hf-ev1=1hFuGtsaEFPgWKmGqc5hcCu8isG6etDq_A@mail.gmail.com>
In-Reply-To: <CAJNyZN6_6hf-ev1=1hFuGtsaEFPgWKmGqc5hcCu8isG6etDq_A@mail.gmail.com>
From: Anyang Hu <huanyang1024@gmail.com>
Date: Wed, 11 Sep 2019 17:11:02 +0800
Message-ID: <CAKvbZNWGoWEAkxGKrYospKVt2QK_dffXpTZawxBgu7ntu4ANNA@mail.gmail.com>
Subject: Re: suggestion of FLINK-10868
To: Till Rohrmann <trohrmann@apache.org>
Cc: Peter Huang <huangzhenqiu0825@gmail.com>, user <user@flink.apache.org>, 
	qi luo <luoqi.bd@gmail.com>, snake.fly318@gmail.com
Content-Type: multipart/alternative; boundary="000000000000185cfb05924367e2"

--000000000000185cfb05924367e2
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Till,

Some of our online batch tasks have strict SLA requirements, and they are
not allowed to be stuck for a long time. Therefore, we take a rude way to
make the job exit immediately. The way to wait for connection recovery is a
better solution. Maybe we need to add a timeout to wait for JM to restore
the connection?

For suggestion 1, make interval configurable, given that we have done it,
and if we can, we hope to give back to the community.

Best regards,
Anyang

Till Rohrmann <trohrmann@apache.org> =E4=BA=8E2019=E5=B9=B49=E6=9C=889=E6=
=97=A5=E5=91=A8=E4=B8=80 =E4=B8=8B=E5=8D=883:09=E5=86=99=E9=81=93=EF=BC=9A

> Hi Anyang,
>
> I think we cannot take your proposal because this means that whenever we
> want to call notifyAllocationFailure when there is a connection problem
> between the RM and the JM, then we fail the whole cluster. This is
> something a robust and resilient system should not do because connection
> problems are expected and need to be handled gracefully. Instead if one
> deems the notifyAllocationFailure message to be very important, then one
> would need to keep it and tell the JM once it has connected back.
>
> Cheers,
> Till
>
> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <huanyang1024@gmail.com> wrote:
>
>> Hi Peter,
>>
>> For our online batch task, there is a scene where the failed Container
>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
>> exit (the probability of JM loss is greatly improved when thousands of
>> Containers is to be started). It is found that the JM disconnection (the
>> reason for JM loss is unknown) will cause the notifyAllocationFailure no=
t
>> to take effect.
>>
>> After the introduction of FLINK-13184
>> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the
>> container with multi-threaded, the JM disconnection situation has been
>> alleviated. In order to stably implement the client immediate exit, we u=
se
>> the following code to determine  whether call onFatalError when
>> MaximumFailedTaskManagerExceedingException is occurd:
>>
>> @Override
>> public void notifyAllocationFailure(JobID jobId, AllocationID allocation=
Id, Exception cause) {
>>    validateRunsInMainThread();
>>
>>    JobManagerRegistration jobManagerRegistration =3D jobManagerRegistrat=
ions.get(jobId);
>>    if (jobManagerRegistration !=3D null) {
>>       jobManagerRegistration.getJobManagerGateway().notifyAllocationFail=
ure(allocationId, cause);
>>    } else {
>>       if (exitProcessOnJobManagerTimedout) {
>>          ResourceManagerException exception =3D new ResourceManagerExcep=
tion("Job Manager is lost, can not notify allocation failure.");
>>          onFatalError(exception);
>>       }
>>    }
>> }
>>
>>
>> Best regards,
>>
>> Anyang
>>
>>

--000000000000185cfb05924367e2
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Till,<br><br>Some of our online batch tasks have strict=
 SLA requirements, and they are not allowed to be stuck for a long time. Th=
erefore, we take a rude way to make the job exit immediately. The way to wa=
it for connection recovery is a better solution. Maybe we need to add a tim=
eout to wait for JM to restore the connection?=C2=A0<div><br>For suggestion=
 1, make interval configurable, given that we have done it, and if we can, =
we hope to give back to the community.<br><div><br></div><div>Best regards,=
</div><div>Anyang</div></div></div><br><div class=3D"gmail_quote"><div dir=
=3D"ltr" class=3D"gmail_attr">Till Rohrmann &lt;<a href=3D"mailto:trohrmann=
@apache.org">trohrmann@apache.org</a>&gt; =E4=BA=8E2019=E5=B9=B49=E6=9C=889=
=E6=97=A5=E5=91=A8=E4=B8=80 =E4=B8=8B=E5=8D=883:09=E5=86=99=E9=81=93=EF=BC=
=9A<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"=
ltr">Hi Anyang,<div><br></div><div>I think we cannot take your proposal bec=
ause this means that whenever we want to call notifyAllocationFailure when =
there is a connection problem between the RM and the JM, then we fail the=
=C2=A0whole cluster. This is something a robust and resilient system should=
 not do because connection problems are expected and need to be handled gra=
cefully. Instead if one deems the notifyAllocationFailure message to be ver=
y important, then one would need to keep it and tell the JM once it has con=
nected back.</div><div><br></div><div>Cheers,</div><div>Till</div></div><br=
><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sun, S=
ep 8, 2019 at 11:26 AM Anyang Hu &lt;<a href=3D"mailto:huanyang1024@gmail.c=
om" target=3D"_blank">huanyang1024@gmail.com</a>&gt; wrote:<br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1=
px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>Hi Peter,=
</div><div><br></div>For our online batch task, there is a scene where the =
failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will n=
ot immediately exit (the probability of JM loss is greatly improved when th=
ousands of Containers is to be started). It is found that the JM disconnect=
ion (the reason for JM loss is unknown) will cause the notifyAllocationFail=
ure not to take effect.=C2=A0<div><br>After the introduction of <a href=3D"=
https://jira.apache.org/jira/browse/FLINK-13184" target=3D"_blank">FLINK-13=
184</a> to start =C2=A0the container with multi-threaded, the JM disconnect=
ion situation has been alleviated. In order to stably implement the client =
immediate exit, we use the following code to determine =C2=A0whether call o=
nFatalError when MaximumFailedTaskManagerExceedingException is occurd:</div=
><div><br></div><div><pre style=3D"font-family:Menlo;font-size:9pt"><span s=
tyle=3D"color:rgb(128,128,0)">@Override<br></span><span style=3D"color:rgb(=
0,0,128);font-weight:bold">public void </span><font color=3D"#000000">notif=
yAllocationFailure</font><span style=3D"color:rgb(102,14,122);font-weight:b=
old;font-style:italic">(</span><font color=3D"#000000">JobID jobId, Allocat=
ionID allocationId, Exception cause</font><span style=3D"color:rgb(102,14,1=
22);font-weight:bold;font-style:italic">) {<br></span><span style=3D"color:=
rgb(102,14,122);font-weight:bold;font-style:italic">   </span><font color=
=3D"#000000">validateRunsInMainThread</font><span style=3D"color:rgb(102,14=
,122);font-weight:bold;font-style:italic">()</span><font color=3D"#000000">=
;<br><br>   JobManagerRegistration jobManagerRegistration =3D </font><span =
style=3D"color:rgb(102,14,122);font-weight:bold">jobManagerRegistrations</s=
pan><font color=3D"#000000">.get</font><span style=3D"color:rgb(102,14,122)=
;font-weight:bold;font-style:italic">(</span><font color=3D"#000000">jobId<=
/font><span style=3D"color:rgb(102,14,122);font-weight:bold;font-style:ital=
ic">)</span><font color=3D"#000000">;<br>   </font><span style=3D"color:rgb=
(0,0,128);font-weight:bold">if </span><span style=3D"color:rgb(102,14,122);=
font-weight:bold;font-style:italic">(</span><font color=3D"#000000">jobMana=
gerRegistration !=3D </font><span style=3D"color:rgb(0,0,128);font-weight:b=
old">null</span><span style=3D"color:rgb(102,14,122);font-weight:bold;font-=
style:italic">) {<br></span><span style=3D"color:rgb(102,14,122);font-weigh=
t:bold;font-style:italic">      </span><font color=3D"#000000">jobManagerRe=
gistration.getJobManagerGateway</font><span style=3D"color:rgb(102,14,122);=
font-weight:bold;font-style:italic">()</span><font color=3D"#000000">.notif=
yAllocationFailure</font><span style=3D"color:rgb(102,14,122);font-weight:b=
old;font-style:italic">(</span><font color=3D"#000000">allocationId, cause<=
/font><span style=3D"color:rgb(102,14,122);font-weight:bold;font-style:ital=
ic">)</span><font color=3D"#000000">;<br>   </font><span style=3D"color:rgb=
(102,14,122);font-weight:bold;font-style:italic">} </span><span style=3D"ba=
ckground-color:rgb(255,255,255)"><font color=3D"#0000ff"><span style=3D"fon=
t-weight:bold">else </span><span style=3D"font-weight:bold;font-style:itali=
c">{<br></span><span style=3D"font-weight:bold;font-style:italic">      </s=
pan><span style=3D"font-weight:bold">if </span><span style=3D"font-weight:b=
old;font-style:italic">(</span><span style=3D"font-weight:bold">exitProcess=
OnJobManagerTimedout</span><span style=3D"font-weight:bold;font-style:itali=
c">) {<br></span><span style=3D"font-weight:bold;font-style:italic">       =
  </span>ResourceManagerException exception =3D <span style=3D"font-weight:=
bold">new </span>ResourceManagerException<span style=3D"font-weight:bold;fo=
nt-style:italic">(</span><span style=3D"font-weight:bold">&quot;Job Manager=
 is lost, can not notify allocation failure.&quot;</span><span style=3D"fon=
t-weight:bold;font-style:italic">)</span>;<br>         onFatalError<span st=
yle=3D"font-weight:bold;font-style:italic">(</span>exception<span style=3D"=
font-weight:bold;font-style:italic">)</span>;<br>      <span style=3D"font-=
weight:bold;font-style:italic">}<br></span></font></span><span style=3D"fon=
t-weight:bold;font-style:italic"><span style=3D"background-color:rgb(255,25=
5,255)"><font color=3D"#0000ff">   }</font></span><font color=3D"#660e7a"><=
br></font></span><span style=3D"color:rgb(102,14,122);font-weight:bold;font=
-style:italic">}</span></pre><pre style=3D"font-family:Menlo;font-size:9pt"=
><span style=3D"color:rgb(102,14,122);font-weight:bold;font-style:italic"><=
br></span></pre><pre style=3D"font-family:Menlo;font-size:9pt">Best regards=
,</pre><pre style=3D"font-family:Menlo;font-size:9pt">Anyang</pre></div></d=
iv>
</blockquote></div>
</blockquote></div>

--000000000000185cfb05924367e2--