openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Thoemmes" <markus.thoem...@de.ibm.com>
Subject System overflow based on invoker status
Date Thu, 12 Jul 2018 15:38:07 GMT
Hi OpenWhiskers,

Today, we have an arbitrary system-wide limit of maximum concurrent connections in the system.
In general that is fine, but it doesn't have a direct correlation to what's actually happening
in the system.

I propose to a new state to each monitored invoker: Overloaded. An invoker will go into overloaded
state if active-acks are starting to timeout. Eventually, if the system is really overloaded,
all Invokers will be in overloaded state which will cause the loadbalancer to return a failure.
This failure now results in a `503 - System overloaded` message back to the user. The system-wide
concurrency limit would be removed.

The organic system-limit will be adjustable by a timeout factor, which is made adjustable
https://github.com/apache/incubator-openwhisk/pull/3767. The default is 2 * maximumActionRuntime
+ 1 minute. For the vast majority of use-cases, this means that there are 3x more activations
in the system than it can handle or put differently: activations need to wait for minutes
until they are executed. I think it's safe to say that the system is overloaded if this is
true for all invokers in your system.

Note: We used to handle active-ack timeouts as system errors and take invokers into unhealthy
state. While having the old non-consistent loadbalancer, that caused a lot of "flappy" states
in the invokers. With the new consistent implementation, active-ack timeouts should only occur
in problematic situations (either the invoker itself is having problems, or queueing). Taking
the invoker out of the loadbalancer if there are active-acks missing on that invoker is generally
helpful, because missing active-acks also means inconsistent state in the loadbalancer (it
updates its state as if the active-ack arrived correctly).

A first stab at the implementation can be found here: https://github.com/apache/incubator-openwhisk/pull/3875.

Any concerns with that approach to place an upper bound on the system?

Cheers,
Markus


Mime
View raw message