openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Thoemmes" <markus.thoem...@de.ibm.com>
Subject Introducing sharding as an alternative for state sharing
Date Thu, 01 Feb 2018 14:25:03 GMT
Hi folks,

we (Christian Bickel and I) just opened a pull-request for comments on introducing a notion
of sharding instead of sharing the state between our controllers (loadbalancers). It also
addresses a few deficiencies of the old loadbalancer to remove any kinds of bottlenecks there
and make it as fast as possible.

Commit message for posterity:

The current ContainerPoolBalancer suffers a couple of problems and bottlenecks:

1. Inconsistent state: The data-structures keeping the state for that loadbalancer are not
thread-safely handled, meaning there can be queuing to some invokers even though there is
free capacity on other invokers.
2. Asynchronously shared state: Sharing the state is needed for a high-available deployment
of multiple controllers and for horizontal scale in those. Said state-sharing makes point
1 even worse and isn't anywhere fast enough to be able to efficiently schedule quick bursts.
3. Bottlenecks: Getting the state from the outside (like for the ActivationThrottle) is a
very costly operation (at least in the shared state case) and actually bottlenecks the whole
invocation path. Getting the current state of the invokers is a second bottleneck, where one
request is made to the corresponding actor for each invocation.
This new implementation aims to solve the problems mentioned above as follows:

1. All state is local: There is no shared state. Resources are managed through horizontal
sharding. Horizontal sharding means: The invokers' slots are evenly divided between the loadbalancers
in existence. If we deploy 2 loadbalancers and each invoker has 16 slots, each of the loadbalancers
will have access to 8 slots on each invoker.
2. Slots are given away atomically: When scheduling an activation, the slot is immediately
assigned to that activation (implemented through Semaphores). That means: Even in concurrent
schedules, there will not be an overload on an invoker as long as there is capacity left on
that invoker.
3. Asynchronous updates of slow data: Slowly changing data, like a change in the invoker's
state, is asynchronously handled and updated to a local version of the state. Querying the
state is as cheap as it can be.

A few words on the implementation details:

We chose to use horizontal sharding (evenly dividing the capacity of each invoker) vs. vertical
sharding (evenly dividing the invokers as a whole) for the sake of staging these changes mainly.
Once we divide vertically, we'll need a good loadbalancing strategy in front of our controllers
themselves, to keep unnecessary cold-starts to a minimum and maximize container reuse. By
dividing horizontally, we maintain the same reuse policies as today and can even keep the
same loadbalancing strategies intact. Horizontal sharding of course only scales so far (maybe
to 4 controllers, assuming 16 slots on each invoker) but it will give us time to figure out
good strategies for vertical sharding and learn along the way. For vertical sharding to work,
it will also be crucial to have the centralized overflow queue to be able to offload work
between shards through workstealing. All in all: Vertical sharding is a much bigger change
than horizontal sharding.

We tried to implement everything in a single actor first, but that seemed to impose a bottleneck
again. Note that this is very frequented code, it needs to be very tight. That might not match
the actor model too well.

Subsuming everything: This keeps all proposed changes intact (most notably Tyson's parallel
executions and overloading queue).

A note on the gains made by this: Our non-blocking invoke performance is now quite close to
the raw Kafka produce performance that we have in the system (not that it's good in itself,
but that's the next theoretical bottleneck). Before the changes, this was roughly bottlenecked
to 4k requests per second on a sufficiently powerful machine. Blocking invoke performance
was roughly doubled.

Any comments, thoughts?

Cheers,
Markus


Mime
View raw message