Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <9BF4FB69-90B5-4951-88F5-259206FC0ED4@expedia.com>
References: <CAMMKjwiCypnNzxsrDgFHnTX6E7bETLo3DyZ898yA7G1tfHr=oA@mail.gmail.com>
 <9BF4FB69-90B5-4951-88F5-259206FC0ED4@expedia.com>
From: Stephan Ewen <sewen@apache.org>
Date: Mon, 21 Aug 2017 18:31:35 +0200
Message-ID: <CANC1h_sd_UDNeHU4TqV1i10wndPEznquz7uReSo3rnM=BhjC3g@mail.gmail.com>
Subject: Re: Flink HA with Kubernetes, without Zookeeper
To: Shannon Carey <scarey@expedia.com>
Cc: Hao Sun <hasun@zendesk.com>, "user@flink.apache.org" <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="f403043cfc240b08ac0557460484"
archived-at: Mon, 21 Aug 2017 16:31:55 -0000

--f403043cfc240b08ac0557460484
Content-Type: text/plain; charset="UTF-8"

Hi!

That is a very interesting proposition. In cases where you have a single
master only, you may bet away with quite good guarantees without ZK. In
fact, Flink does not store significant data in ZK at all, it only uses
locks and counters.

You can have a setup without ZK, provided you have the following:

  - All processes restart (a lost JobManager restarts eventually). Should
be given in Kubernetes.

  - A way for TaskManagers to discover the restarted JobManager. Should
work via Kubernetes as well (restarted containers retain the external
hostname)

  - A way to isolate different "leader sessions" against each other. Flink
currently uses ZooKeeper to also attach a "leader session ID" to leader
election, which is a fencing token to avoid that processes talk to each
other despite having different views on who is the leader, or whether the
leaser lost and re-gained leadership.

  - An atomic marker for what is the latest completed checkpoint.

  - A distributed atomic counter for the checkpoint ID. This is crucial to
ensure correctness of checkpoints in the presence of JobManager failures
and re-elections or split-brain situations.

I would assume that etcd can provide all of those services. The best way to
integrate it would probably be to add an implementation of Flink's
"HighAvailabilityServices" based on etcd.

Have a look at this class:
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServices.java

If you want to contribute an extension of Flink using etcd, that would be
awesome.
This should have a FLIP though, and a plan on how to set up rigorous unit
testing of that implementation (because its correctness is very crucial to
Flink's HA resilience).

Best,
Stephan


On Mon, Aug 21, 2017 at 4:15 PM, Shannon Carey <scarey@expedia.com> wrote:

> Zookeeper should still be necessary even in that case, because it is where
> the JobManager stores information which needs to be recovered after the
> JobManager fails.
>
> We're eyeing https://github.com/coreos/zetcd as a way to run Zookeeper on
> top of Kubernetes' etcd cluster so that we don't have to rely on a separate
> Zookeeper cluster. However, we haven't tried it yet.
>
> -Shannon
>
> From: Hao Sun <hasun@zendesk.com>
> Date: Sunday, August 20, 2017 at 9:04 PM
> To: "user@flink.apache.org" <user@flink.apache.org>
> Subject: Flink HA with Kubernetes, without Zookeeper
>
> Hi, I am new to Flink and trying to bring up a Flink cluster on top of
> Kubernetes.
>
> For HA setup, with kubernetes, I think I just need one job manager and do
> not need Zookeeper? I will store all states to S3 buckets. So in case of
> failure, kubernetes can just bring up a new job manager without losing
> anything?
>
> I want to confirm my assumptions above make sense. Thanks
>

--f403043cfc240b08ac0557460484
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi!<div><br></div><div>That is a very interesting proposit=
ion. In cases where you have a single master only, you may bet away with qu=
ite good guarantees without ZK. In fact, Flink does not store significant d=
ata in ZK at all, it only uses locks and counters.</div><div><br></div><div=
>You can have a setup without ZK, provided you have the following:</div><di=
v><br></div><div>=C2=A0 - All processes restart (a lost JobManager restarts=
 eventually). Should be given in Kubernetes.</div><div><br></div><div>=C2=
=A0 - A way for TaskManagers to discover the restarted JobManager. Should w=
ork via Kubernetes as well (restarted containers retain the external hostna=
me)</div><div><br></div><div>=C2=A0 - A way to isolate different &quot;lead=
er sessions&quot; against each other. Flink currently uses ZooKeeper to als=
o attach a &quot;leader session ID&quot; to leader election, which is a fen=
cing token to avoid that processes talk to each other despite having differ=
ent views on who is the leader, or whether the leaser lost and re-gained le=
adership.</div><div><br></div><div>=C2=A0 - An atomic marker for what is th=
e latest completed checkpoint.</div><div><br></div><div>=C2=A0 - A distribu=
ted atomic counter for the checkpoint ID. This is crucial to ensure correct=
ness of checkpoints in the presence of JobManager failures and re-elections=
 or split-brain situations.</div><div><br></div><div>I would assume that et=
cd can provide all of those services. The best way to integrate it would pr=
obably be to add an implementation of Flink&#39;s &quot;HighAvailabilitySer=
vices&quot; based on etcd.</div><div><br></div><div>Have a look at this cla=
ss:=C2=A0<a href=3D"https://github.com/apache/flink/blob/master/flink-runti=
me/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKee=
perHaServices.java">https://github.com/apache/flink/blob/master/flink-runti=
me/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKee=
perHaServices.java</a></div><div><br></div><div>If you want to contribute a=
n extension of Flink using etcd, that would be awesome.</div><div>This shou=
ld have a FLIP though, and a plan on how to set up rigorous unit testing of=
 that implementation (because its correctness is very crucial to Flink&#39;=
s HA resilience).=C2=A0</div><div><br></div><div>Best,</div><div>Stephan</d=
iv><div><br></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote"=
>On Mon, Aug 21, 2017 at 4:15 PM, Shannon Carey <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:scarey@expedia.com" target=3D"_blank">scarey@expedia.com</a>&g=
t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div style=3D"word-wrap:break-word;color:rgb(0,0,0);font-size:14px;font-fam=
ily:Calibri,sans-serif">
<div>
<div>
<div>Zookeeper should still be necessary even in that case, because it is w=
here the JobManager stores information which needs to be recovered after th=
e JobManager fails.</div>
<div><br>
</div>
<div>We&#39;re eyeing=C2=A0<a href=3D"https://github.com/coreos/zetcd" targ=
et=3D"_blank">https://github.com/<wbr>coreos/zetcd</a>=C2=A0as a way to run=
 Zookeeper on top of Kubernetes&#39; etcd cluster so that we don&#39;t have=
 to rely on a separate Zookeeper cluster. However, we haven&#39;t tried it =
yet.</div>
<div><br>
</div>
<div>-Shannon</div>
<div>
<div id=3D"m_-7818988249485219755MAC_OUTLOOK_SIGNATURE"></div>
</div>
</div>
</div>
<div><br>
</div>
<span id=3D"m_-7818988249485219755OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri;font-size:12pt;text-align:left;color:blac=
k;BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOTTOM:0in;PADD=
ING-LEFT:0in;PADDING-RIGHT:0in;BORDER-TOP:#b5c4df 1pt solid;BORDER-RIGHT:me=
dium none;PADDING-TOP:3pt">
<span style=3D"font-weight:bold">From: </span>Hao Sun &lt;<a href=3D"mailto=
:hasun@zendesk.com" target=3D"_blank">hasun@zendesk.com</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Sunday, August 20, 2017 at 9:=
04 PM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@fl=
ink.apache.org" target=3D"_blank">user@flink.apache.org</a>&quot; &lt;<a hr=
ef=3D"mailto:user@flink.apache.org" target=3D"_blank">user@flink.apache.org=
</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Flink HA with Kubernetes, =
without Zookeeper<br>
</div><span class=3D"">
<div><br>
</div>
<div dir=3D"ltr">Hi, I am new to Flink and trying to bring up a Flink clust=
er on top of Kubernetes.
<div><br>
</div>
<div>For HA setup, with kubernetes, I think I just need one job manager and=
 do not need Zookeeper? I will store all states to S3 buckets. So in case o=
f failure, kubernetes can just bring up a new job manager without losing an=
ything?</div>
<div><br>
</div>
<div>I want to confirm my assumptions above make sense. Thanks</div>
</div>
</span></span>
</div>

</blockquote></div><br></div></div>

--f403043cfc240b08ac0557460484--