openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rodric Rabbah <>
Subject Re: System overflow based on invoker status
Date Tue, 17 Jul 2018 17:05:39 GMT
Hi Markus

Per our discussion on slack, I’m documenting below the concerns we discussed. (And thanks
for fixing my math bug.)

The approach of being more introspective to detect overload is a good improvement over the
ad hoc value set today. Thanks for bringing this up. This is a general improvement although
I do have a concern about tying the system overload (and queuing depth) to active acks which
also affects other components. Please allow me to explain these so that we can see if there's
a real concern. Thanks for reviewing this on slack and discussion around this.

** Namely, the execution of sequences (and conductor actions) which wait for activations to
process the next action --- if you're willing to tolerate a longer active ack, should the
composition wait just as long? Second, we also had issues in the past where improper accounting
of the requests outstanding for a user due to delayed or missing active acks would penalize
and throttle a subject. Lastly, there is a backup mechanism for detecting completed activations
from the active store, higher active acks means longer polls and load on the database.

** The need for this mechanism suggests the health protocol which uses pings alone is  not
sufficient and needs this secondary mechanism. The active acks as noted above now have a few
intertwined dependences.

** Since the definition of overloaded here is tied to active acks timing out, I also think
we would be changing the behavior of the system overall where requests that would be accepted
and queued in the past would be rejected much more eagerly. This makes the issue also related
to re-architecting the system with the overflow queue as previously discussed on the dev list
because there are requests or which _waiting_ is ok (e.g., batch and triggers) vs blocking
requests (web actions) where waiting too long is not acceptable.

** Of course this is related to the capacity in the system and assumes static capacity. Shameless
plug for The Serverless Contract
If you detect overload and add capacity, it's a different discussion (not rejecting requests
subject to a max elasticity vs rejecting requests for a given capacity).

Say an active ack for an activation _i_ times out if after time 
    T(i)  = L(i) x C + epsilon
where L(i) is the action's max duration for activation i, and C is the constant fudge factor
(which is indirectly the wait time in the queue for this activation).

Let an invoker have N slots, all of which are occupied with max duration L(j) for all _j_
in the container pool >= L(i) that is all the slots are busy in the assigned pool and the
hold time will be at least L(i) for all the slots.

Since an active ack's time out T(i) is oblivious to the the requests ahead of it in the queue,
it would take C x S requests ahead of activation i in the queue for the request to timeout.
I think wlog we can ignore the epsilon (C x S + 1) for example would cover it, and we can
ignore the actual execution time of activation i).

The system would be overloaded when there are (K x S) + (K x (S x C + 1)). 

where K is the number of invokers,
and S is the number of slots per invoker,
and C is the queuing factor for requests in the queue ( >= 0)
where all actions have an expected hold time that is the same

So some numbers: 
K = 1 invokers x S = 16 slots per invokers, and C = 2: system will overload (and reject requests)
after 49 activations are accepted. 
K = 10, 490, and 
K = 100 then 4900.

If we increase C to tolerate more queuing, then we indirectly also affect the execution of
compositions and quotas. I think we should as you suggest have a mechanism for detecting overload
correctly so this is a better approach given where we are.

We should caution that if a deployment has a disproportionately high overload setting in their
configuration they will need to be aware of this change. 


> On Thu, Jul 12, 2018 at 11:38 AM, Markus Thoemmes <>
> Hi OpenWhiskers,
> Today, we have an arbitrary system-wide limit of maximum concurrent connections in the
system. In general that is fine, but it doesn't have a direct correlation to what's actually
happening in the system.
> I propose to a new state to each monitored invoker: Overloaded. An invoker will go into
overloaded state if active-acks are starting to timeout. Eventually, if the system is really
overloaded, all Invokers will be in overloaded state which will cause the loadbalancer to
return a failure. This failure now results in a `503 - System overloaded` message back to
the user. The system-wide concurrency limit would be removed.
> The organic system-limit will be adjustable by a timeout factor, which is made adjustable The default is 2 * maximumActionRuntime
+ 1 minute. For the vast majority of use-cases, this means that there are 3x more activations
in the system than it can handle or put differently: activations need to wait for minutes
until they are executed. I think it's safe to say that the system is overloaded if this is
true for all invokers in your system.
> Note: We used to handle active-ack timeouts as system errors and take invokers into unhealthy
state. While having the old non-consistent loadbalancer, that caused a lot of "flappy" states
in the invokers. With the new consistent implementation, active-ack timeouts should only occur
in problematic situations (either the invoker itself is having problems, or queueing). Taking
the invoker out of the loadbalancer if there are active-acks missing on that invoker is generally
helpful, because missing active-acks also means inconsistent state in the loadbalancer (it
updates its state as if the active-ack arrived correctly).
> A first stab at the implementation can be found here:
> Any concerns with that approach to place an upper bound on the system?
> Cheers,
> Markus

  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message