openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sven Lange-Last" <>
Subject OpenWhisk invoker overloads - "Rescheduling Run message"
Date Thu, 04 Jul 2019 08:36:04 GMT
When running Apache OpenWhisk with a high load of heterogeneous action 
invocations - i.e. high variance in memory / time limit and high number of 
different namespaces / actions -, single invokers are occasionally 
overloaded. With this post, I want to share my observations and 
conclusions with the community. As a consequence, I'm planning to provide 
a series of pull requests that are meant to provide more insights when 
running into the invoker overloads.

Please share your observations and conclusions as well. I'm looking 
forward to your feedback.

Invoker overloads can be concluded from the occurrence of the (in)famous 
"Rescheduling Run message" log entry on invokers [1]:

"Rescheduling Run message, too many message in the pool, freePoolSize: 0 
containers and 0 MB, busyPoolSize: 2 containers and 512 MB, 
maxContainersMemory 512 MB, userNamespace: namespace, action: 
ExecutableWhiskAction/namespace/package/action@0.0.1, needed memory: 256 
MB, waiting messages: 0"

Excess activations on overloaded invokers need to wait until any of the 
currently running actions completes. This can take up to the maximum 
action time limit, potentially causing inacceptable wait times for these 
excess activations.

In large OW installations, we see thousands of said log entries at times 
in a few days. Do other OW adoptors share these observations?

I'm aware of two conditions that cause overloaded invokers by design:

1. A controller schedules an activation - but no acknowledgement is 
received from the invoker within the expected time because the invoker 
takes too long to complete the activation. Based on timer expiration, a 
forced acknowledgement removes said activation from the load balancer's 
memory book-keeping. With the released capacity, the load balancer 
schedules a new activation to the invoker - which may still be running the 
action that timed out in the controller before.

2. A controller cannot identify an invoker that has enough free capacity 
to schedule an activation - in particular, this can happen if the 
activation's memory limit is equal to the controller's shard memory size 
on the invoker. If there is at least one usable invoker, the load balancer 
will select a random invoker to schedule the activation. This situation is 
called overload in the load balancer code and will yield a book-keeping 
semaphore with a negative value. Apparently, the selected invoker cannot 
process the scheduled activation.

Did I miss other conditions that cause overloaded invokers by design?

I suspect that there are additional causes for overloaded invokers - 
design flaws in the controller / load balancer or even bugs. I'm 
suggesting to extend log messages and improve existing metrics / introduce 
new metrics to better understand what's going on with overloaded invokers. 
We need to be careful when extending log messages - we must neither 
considerably increase the log volume nor impact performance due to 
additional operations. Overall, the goal is to eventually fix invoker 

I already opened a pull request (together with Sugandha) to add the action 
timeout limit to the invoker assignment message in load balancer plus some 
other changes [2]. Please check the PR for details.

I'm planning further pull requests in these areas:

* At times, we see a non-negligible amount of forced acknowledgements in 
large OW installations. For this reason, I suggest to extend log messages 
in the processCompletion() method [3] in the common load balancer code to 
provide more diagnostic information when forced acknowledgements occur. In 
particular, I want to add more information about the invoker processing 
the activation and the action itself. A metric reflecting forced 
acknowledgements also seems helpful.

* As discussed above, the load balancer will schedule an activation to a 
randomly selected usable invoker if it cannot find a usable invoker that 
has enough free user memory ("overload"). This can also be caused by 
fragmentation where all invokers are running activations with small memory 
limits and the activation to be scheduled has a very high memory limit. 
Even though the invoker pool may have plenty of free user memory in total, 
no single invoker may have enough free memory to fit a large activation.

For this reason, I'm planning to extend the schedule() method [4] in the 
ShardingContainerPoolBalancer to collect more information about scheduling 
that is logged afterwards: how many invokers were visited? Which minimum, 
average and maximum free memory did the usable invokers have that were not 

* When the "Rescheduling Run message" log entry on invokers [1] occurs, we 
don't know what's currently going on in busy and free container pools. I'm 
planning to extend the log message with more detail information about the 
pools to better understand the scheduling history of this invoker. We need 
to understand which activations currently occupy the invoker.

Please let me know what you think.


Mit freundlichen Grüßen / Regards,

Sven Lange-Last
Senior Software Engineer
IBM Cloud Functions
Apache OpenWhisk

Find me on:  

Schoenaicher Str. 220
Boeblingen, 71032

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
HRB 243294

View raw message