openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Browning <>
Subject Re: Enablement of controller clustering
Date Fri, 22 Sep 2017 18:10:28 GMT
In Kubernetes and OpenShift, we'd use StatefulSets to give stable hostnames
for the controllers (or at least controller seed nodes). The IPs may change
when a node dies and gets replaced, but the hostnames would be stable as,,, etc.

It would be ideal if we didn't need stable hostnames or IPs, but I believe
CouchDB, Zookeeper and Kafka will have to be treated similarly for their
underlying clustering mechanisms to work as expected.


On Fri, Sep 22, 2017 at 12:53 PM, Tyson Norris <>

> Thanks Vadim!
> A couple comments:
> - just to be clear: this is leveraging Akka Clustering (not just Akka
> Remoting)
> - I’m interested to hear if "deployment models where controller
> container’s IP changes upon the restart” is actually an edge case (it is
> not for us)
> - I’m not an Akka or Akka Cluster expert, but we’ve been testing Akka
> clustering (separate from OW) this and had problems in these cases due to
> dynamic IPs, where it has required logic to explicitly down the nodes to
> return to normal operation after a failure; (would like to hear from any
> Akka/Cluster experts on this topic!)
> IMHO, this is often NOT an edge case, and as such, until the impl is more
> flexible (to allow how seed nodes are defined and downing is handled), then
> the default should be to NOT enable this.
> For example, in mesos, we will not predict the IP address of the
> controller at restart, so this will lead to unreachable nodes list that is
> never cleared without manual intervention.
> I mentioned this would be OK (as a first step, to require manual
> intervention), but I think the default should be to disable this clustering
> until it can be handled for various deployment scenarios, and in the
> meantime, if people do want to enable this for the “dynamic IP” scenario,
> there needs to be documentation to indicate exactly what steps need to be
> take to handle downing, and what the risks are of NOT doing this.
> Of course this could be seen as "just a matter of defaults”, so its not
> technically a big difference to enable it by default (vs disabled), but I
> would err on the side that will produce the best results for more operators.
> Thanks
> Tyson
> > On Sep 22, 2017, at 9:00 AM, Vadim Raskin <> wrote:
> >
> > Hi everyone,
> > (sorry if dup, had some issues with mail delivery)
> >
> > just wanted to give a small introduction to a piece of work which is
> > currently ongoing in the field of controller scale out. In order to
> enable
> > several active controller instances running simultaneously we introduce
> > controller clustering, whose main purpose is to share the controller
> > bookkeeping information, e.g. activations per invoker and activations per
> > namespace. Under the hood we use Akka Remoting, which showed good
> behaviour
> > with no regression in our test environments. The introduction of this
> > feature alone should not change the external behaviour of controllers
> > unless the routing to more then one controller is explicitly enabled.
> >
> > The next recommended steps after the clustering goes into the master:
> > - keep two controllers deployed as before in an active-passive mode with
> > clustering enabled, let controllers replicate their data meanwhile
> > collecting operational experience.
> > - scale out the number of controller nodes, enable active-active mode in
> > the upfront loadbalancer.
> >
> > A couple of things to keep in mind:
> > * this change comes with a feature toggle, which means you could easily
> > turn off clustering by setting a controllerLocalBookkeeping in your
> > deployment. This is more appropriate for the first phase when only one
> > controller is active.
> > * there could be certain edge cases where clustering would require a
> > special treatment in case of deployment models where controller
> container's
> > IP changes upon the restart. Say if one controller has failed and joined
> > the cluster as a new member, there will be some garbage accumulated in
> the
> > list of cluster members. It is not harmful per se, e.g. the cluster is
> > still running, however healthy cluster nodes will be still gossiping
> with a
> > non-existing container. If assigning static IP addresses is not an
> option,
> > in order to avoid this case one could use auto-downing feature in akka
> > cluster, which allows to a cluster leader to mark the failing node as
> down
> > and remove it from the cluster. To prevent cluster partitioning due to
> > several leaders this property must be set a relatively high value. The
> > number is not deterministic and could be defined based on the further ops
> > experience.
> >
> > If you have any feedback regarding this change, you could respond in this
> > thread, ping me on slack or comment in this PR:
> >
> openwhisk%2Fpull%2F2531&data=02%7C01%7C%7C53dd4bae8c49491e0c7b08d501d3
> 0688%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%
> 7C636416928221587729&sdata=OiNhlcwMf2G5VtlSq%2Fxp4z0Rf6bv64wQilCRehEbmMI%
> 3D&reserved=0
> >
> > regards, Vadim Raskin.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message