openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyson Norris <>
Subject Re: Enablement of controller clustering
Date Fri, 22 Sep 2017 16:53:38 GMT
Thanks Vadim!

A couple comments:
- just to be clear: this is leveraging Akka Clustering (not just Akka Remoting)
- I’m interested to hear if "deployment models where controller container’s IP changes
upon the restart” is actually an edge case (it is not for us)
- I’m not an Akka or Akka Cluster expert, but we’ve been testing Akka clustering (separate
from OW) this and had problems in these cases due to dynamic IPs, where it has required logic
to explicitly down the nodes to return to normal operation after a failure; (would like to
hear from any Akka/Cluster experts on this topic!)  

IMHO, this is often NOT an edge case, and as such, until the impl is more flexible (to allow
how seed nodes are defined and downing is handled), then the default should be to NOT enable

For example, in mesos, we will not predict the IP address of the controller at restart, so
this will lead to unreachable nodes list that is never cleared without manual intervention.

I mentioned this would be OK (as a first step, to require manual intervention), but I think
the default should be to disable this clustering until it can be handled for various deployment
scenarios, and in the meantime, if people do want to enable this for the “dynamic IP”
scenario, there needs to be documentation to indicate exactly what steps need to be take to
handle downing, and what the risks are of NOT doing this. 

Of course this could be seen as "just a matter of defaults”, so its not technically a big
difference to enable it by default (vs disabled), but I would err on the side that will produce
the best results for more operators. 



> On Sep 22, 2017, at 9:00 AM, Vadim Raskin <> wrote:
> Hi everyone,
> (sorry if dup, had some issues with mail delivery)
> just wanted to give a small introduction to a piece of work which is
> currently ongoing in the field of controller scale out. In order to enable
> several active controller instances running simultaneously we introduce
> controller clustering, whose main purpose is to share the controller
> bookkeeping information, e.g. activations per invoker and activations per
> namespace. Under the hood we use Akka Remoting, which showed good behaviour
> with no regression in our test environments. The introduction of this
> feature alone should not change the external behaviour of controllers
> unless the routing to more then one controller is explicitly enabled.
> The next recommended steps after the clustering goes into the master:
> - keep two controllers deployed as before in an active-passive mode with
> clustering enabled, let controllers replicate their data meanwhile
> collecting operational experience.
> - scale out the number of controller nodes, enable active-active mode in
> the upfront loadbalancer.
> A couple of things to keep in mind:
> * this change comes with a feature toggle, which means you could easily
> turn off clustering by setting a controllerLocalBookkeeping in your
> deployment. This is more appropriate for the first phase when only one
> controller is active.
> * there could be certain edge cases where clustering would require a
> special treatment in case of deployment models where controller container's
> IP changes upon the restart. Say if one controller has failed and joined
> the cluster as a new member, there will be some garbage accumulated in the
> list of cluster members. It is not harmful per se, e.g. the cluster is
> still running, however healthy cluster nodes will be still gossiping with a
> non-existing container. If assigning static IP addresses is not an option,
> in order to avoid this case one could use auto-downing feature in akka
> cluster, which allows to a cluster leader to mark the failing node as down
> and remove it from the cluster. To prevent cluster partitioning due to
> several leaders this property must be set a relatively high value. The
> number is not deterministic and could be defined based on the further ops
> experience.
> If you have any feedback regarding this change, you could respond in this
> thread, ping me on slack or comment in this PR:
> regards, Vadim Raskin.

View raw message