openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Raskin <raskinva...@gmail.com>
Subject Re: Enablement of controller clustering
Date Fri, 22 Sep 2017 17:42:51 GMT
Thanks for the feedback.

I'm ok with keeping local bookkeeping as a default for a while.

Regarding the "edge case", what I meant is that it is not an issue to add
the same node under a different IP into the cluster during the outage,
based on the tests that I've made. NOT that deployment models without
static IPs is the "edge case".

Regards, Vadim.

On пт, 22 сент. 2017 г. at 18:53 Tyson Norris <tnorris@adobe.com.invalid>
wrote:

> Thanks Vadim!
>
> A couple comments:
> - just to be clear: this is leveraging Akka Clustering (not just Akka
> Remoting)
> - I’m interested to hear if "deployment models where controller
> container’s IP changes upon the restart” is actually an edge case (it is
> not for us)
> - I’m not an Akka or Akka Cluster expert, but we’ve been testing Akka
> clustering (separate from OW) this and had problems in these cases due to
> dynamic IPs, where it has required logic to explicitly down the nodes to
> return to normal operation after a failure; (would like to hear from any
> Akka/Cluster experts on this topic!)
>
> IMHO, this is often NOT an edge case, and as such, until the impl is more
> flexible (to allow how seed nodes are defined and downing is handled), then
> the default should be to NOT enable this.
>
> For example, in mesos, we will not predict the IP address of the
> controller at restart, so this will lead to unreachable nodes list that is
> never cleared without manual intervention.
>
> I mentioned this would be OK (as a first step, to require manual
> intervention), but I think the default should be to disable this clustering
> until it can be handled for various deployment scenarios, and in the
> meantime, if people do want to enable this for the “dynamic IP” scenario,
> there needs to be documentation to indicate exactly what steps need to be
> take to handle downing, and what the risks are of NOT doing this.
>
> Of course this could be seen as "just a matter of defaults”, so its not
> technically a big difference to enable it by default (vs disabled), but I
> would err on the side that will produce the best results for more operators.
>
> WDYT?
>
> Thanks
> Tyson
>
> > On Sep 22, 2017, at 9:00 AM, Vadim Raskin <raskinvadim@gmail.com> wrote:
> >
> > Hi everyone,
> > (sorry if dup, had some issues with mail delivery)
> >
> > just wanted to give a small introduction to a piece of work which is
> > currently ongoing in the field of controller scale out. In order to
> enable
> > several active controller instances running simultaneously we introduce
> > controller clustering, whose main purpose is to share the controller
> > bookkeeping information, e.g. activations per invoker and activations per
> > namespace. Under the hood we use Akka Remoting, which showed good
> behaviour
> > with no regression in our test environments. The introduction of this
> > feature alone should not change the external behaviour of controllers
> > unless the routing to more then one controller is explicitly enabled.
> >
> > The next recommended steps after the clustering goes into the master:
> > - keep two controllers deployed as before in an active-passive mode with
> > clustering enabled, let controllers replicate their data meanwhile
> > collecting operational experience.
> > - scale out the number of controller nodes, enable active-active mode in
> > the upfront loadbalancer.
> >
> > A couple of things to keep in mind:
> > * this change comes with a feature toggle, which means you could easily
> > turn off clustering by setting a controllerLocalBookkeeping in your
> > deployment. This is more appropriate for the first phase when only one
> > controller is active.
> > * there could be certain edge cases where clustering would require a
> > special treatment in case of deployment models where controller
> container's
> > IP changes upon the restart. Say if one controller has failed and joined
> > the cluster as a new member, there will be some garbage accumulated in
> the
> > list of cluster members. It is not harmful per se, e.g. the cluster is
> > still running, however healthy cluster nodes will be still gossiping
> with a
> > non-existing container. If assigning static IP addresses is not an
> option,
> > in order to avoid this case one could use auto-downing feature in akka
> > cluster, which allows to a cluster leader to mark the failing node as
> down
> > and remove it from the cluster. To prevent cluster partitioning due to
> > several leaders this property must be set a relatively high value. The
> > number is not deterministic and could be defined based on the further ops
> > experience.
> >
> > If you have any feedback regarding this change, you could respond in this
> > thread, ping me on slack or comment in this PR:
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-openwhisk%2Fpull%2F2531&data=02%7C01%7C%7C53dd4bae8c49491e0c7b08d501d30688%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636416928221587729&sdata=OiNhlcwMf2G5VtlSq%2Fxp4z0Rf6bv64wQilCRehEbmMI%3D&reserved=0
> >
> > regards, Vadim Raskin.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message