Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 61183200CEC for ; Mon, 21 Aug 2017 18:31:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5FF3A163970; Mon, 21 Aug 2017 16:31:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7F53416396D for ; Mon, 21 Aug 2017 18:31:54 +0200 (CEST) Received: (qmail 48843 invoked by uid 500); 21 Aug 2017 16:31:52 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 48832 invoked by uid 99); 21 Aug 2017 16:31:52 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Aug 2017 16:31:52 +0000 Received: from mail-pg0-f45.google.com (mail-pg0-f45.google.com [74.125.83.45]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 7F91F1A00A5 for ; Mon, 21 Aug 2017 16:31:51 +0000 (UTC) Received: by mail-pg0-f45.google.com with SMTP id m133so20679577pga.5 for ; Mon, 21 Aug 2017 09:31:51 -0700 (PDT) X-Gm-Message-State: AHYfb5jPMsjvbtA9Q8i2X1hnoXGgcbAkZeIUjm/mfTi6hbhDX1sSM0Wm fKHganmOONCh+OzV5rPcxTN2hywPog== X-Received: by 10.99.180.8 with SMTP id s8mr17198118pgf.166.1503333110710; Mon, 21 Aug 2017 09:31:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.151.142 with HTTP; Mon, 21 Aug 2017 09:31:35 -0700 (PDT) In-Reply-To: <9BF4FB69-90B5-4951-88F5-259206FC0ED4@expedia.com> References: <9BF4FB69-90B5-4951-88F5-259206FC0ED4@expedia.com> From: Stephan Ewen Date: Mon, 21 Aug 2017 18:31:35 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Flink HA with Kubernetes, without Zookeeper To: Shannon Carey Cc: Hao Sun , "user@flink.apache.org" Content-Type: multipart/alternative; boundary="f403043cfc240b08ac0557460484" archived-at: Mon, 21 Aug 2017 16:31:55 -0000 --f403043cfc240b08ac0557460484 Content-Type: text/plain; charset="UTF-8" Hi! That is a very interesting proposition. In cases where you have a single master only, you may bet away with quite good guarantees without ZK. In fact, Flink does not store significant data in ZK at all, it only uses locks and counters. You can have a setup without ZK, provided you have the following: - All processes restart (a lost JobManager restarts eventually). Should be given in Kubernetes. - A way for TaskManagers to discover the restarted JobManager. Should work via Kubernetes as well (restarted containers retain the external hostname) - A way to isolate different "leader sessions" against each other. Flink currently uses ZooKeeper to also attach a "leader session ID" to leader election, which is a fencing token to avoid that processes talk to each other despite having different views on who is the leader, or whether the leaser lost and re-gained leadership. - An atomic marker for what is the latest completed checkpoint. - A distributed atomic counter for the checkpoint ID. This is crucial to ensure correctness of checkpoints in the presence of JobManager failures and re-elections or split-brain situations. I would assume that etcd can provide all of those services. The best way to integrate it would probably be to add an implementation of Flink's "HighAvailabilityServices" based on etcd. Have a look at this class: https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperHaServices.java If you want to contribute an extension of Flink using etcd, that would be awesome. This should have a FLIP though, and a plan on how to set up rigorous unit testing of that implementation (because its correctness is very crucial to Flink's HA resilience). Best, Stephan On Mon, Aug 21, 2017 at 4:15 PM, Shannon Carey wrote: > Zookeeper should still be necessary even in that case, because it is where > the JobManager stores information which needs to be recovered after the > JobManager fails. > > We're eyeing https://github.com/coreos/zetcd as a way to run Zookeeper on > top of Kubernetes' etcd cluster so that we don't have to rely on a separate > Zookeeper cluster. However, we haven't tried it yet. > > -Shannon > > From: Hao Sun > Date: Sunday, August 20, 2017 at 9:04 PM > To: "user@flink.apache.org" > Subject: Flink HA with Kubernetes, without Zookeeper > > Hi, I am new to Flink and trying to bring up a Flink cluster on top of > Kubernetes. > > For HA setup, with kubernetes, I think I just need one job manager and do > not need Zookeeper? I will store all states to S3 buckets. So in case of > failure, kubernetes can just bring up a new job manager without losing > anything? > > I want to confirm my assumptions above make sense. Thanks > --f403043cfc240b08ac0557460484 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi!

That is a very interesting proposit= ion. In cases where you have a single master only, you may bet away with qu= ite good guarantees without ZK. In fact, Flink does not store significant d= ata in ZK at all, it only uses locks and counters.

You can have a setup without ZK, provided you have the following:

=C2=A0 - All processes restart (a lost JobManager restarts= eventually). Should be given in Kubernetes.

=C2= =A0 - A way for TaskManagers to discover the restarted JobManager. Should w= ork via Kubernetes as well (restarted containers retain the external hostna= me)

=C2=A0 - A way to isolate different "lead= er sessions" against each other. Flink currently uses ZooKeeper to als= o attach a "leader session ID" to leader election, which is a fen= cing token to avoid that processes talk to each other despite having differ= ent views on who is the leader, or whether the leaser lost and re-gained le= adership.

=C2=A0 - An atomic marker for what is th= e latest completed checkpoint.

=C2=A0 - A distribu= ted atomic counter for the checkpoint ID. This is crucial to ensure correct= ness of checkpoints in the presence of JobManager failures and re-elections= or split-brain situations.

I would assume that et= cd can provide all of those services. The best way to integrate it would pr= obably be to add an implementation of Flink's "HighAvailabilitySer= vices" based on etcd.


If you want to contribute a= n extension of Flink using etcd, that would be awesome.
This shou= ld have a FLIP though, and a plan on how to set up rigorous unit testing of= that implementation (because its correctness is very crucial to Flink'= s HA resilience).=C2=A0

Best,
Stephan


On Mon, Aug 21, 2017 at 4:15 PM, Shannon Carey <scarey@expedia.com&g= t; wrote:
Zookeeper should still be necessary even in that case, because it is w= here the JobManager stores information which needs to be recovered after th= e JobManager fails.

We're eyeing=C2=A0https://github.com/coreos/zetcd=C2=A0as a way to run= Zookeeper on top of Kubernetes' etcd cluster so that we don't have= to rely on a separate Zookeeper cluster. However, we haven't tried it = yet.

-Shannon

From: Hao Sun <hasun@zendesk.com>
Date: Sunday, August 20, 2017 at 9:= 04 PM
To: "user@flink.apache.org" <user@flink.apache.org= >
Subject: Flink HA with Kubernetes, = without Zookeeper

Hi, I am new to Flink and trying to bring up a Flink clust= er on top of Kubernetes.

For HA setup, with kubernetes, I think I just need one job manager and= do not need Zookeeper? I will store all states to S3 buckets. So in case o= f failure, kubernetes can just bring up a new job manager without losing an= ything?

I want to confirm my assumptions above make sense. Thanks

--f403043cfc240b08ac0557460484--