openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Raskin <>
Subject Enablement of controller clustering
Date Fri, 22 Sep 2017 15:07:29 GMT
Hi everyone,

just wanted to give a small introduction to a piece of work which is
currently ongoing in the field of controller scale out. In order to enable
several active controller instances running simultaneously we introduce
controller clustering, whose main purpose is to share the controller
bookkeeping information, e.g. activations per invoker and activations per
namespace. Under the hood we use Akka Remoting, which showed good behaviour
with no regression in our test environments. The introduction of this
feature alone should not change the external behaviour of controllers
unless the routing to more then one controller is explicitly enabled.

The next recommended steps after the clustering goes into the master:
- keep two controllers deployed as before in an active-passive mode with
clustering enabled, let controllers replicate their data meanwhile
collecting operational experience.
- scale out the number of controller nodes, enable active-active mode in
the upfront loadbalancer.

A couple of things to keep in mind:
* this change comes with a feature toggle, which means you could easily
turn off clustering by setting a controllerLocalBookkeeping in your
deployment. This is more appropriate for the first phase when only one
controller is active.
* there could be certain edge cases where clustering would require a
special treatment in case of deployment models where controller container's
IP changes upon the restart. Say if one controller has failed and joined
the cluster as a new member, there will be some garbage accumulated in the
list of cluster members. It is not harmful per se, e.g. the cluster is
still running, however healthy cluster nodes will be still gossiping with a
non-existing container. If assigning static IP addresses is not an option,
in order to avoid this case one could use auto-downing feature in akka
cluster, which allows to a cluster leader to mark the failing node as down
and remove it from the cluster. To prevent cluster partitioning due to
several leaders this property must be set a relatively high value. The
number is not deterministic and could be defined based on the further ops

If you have any feedback regarding this change, you could respond in this
thread, ping me on slack or comment in this PR:

regards, Vadim Raskin.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message