openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyson Norris <>
Subject Re: Proposal on a future architecture of OpenWhisk
Date Thu, 16 Aug 2018 21:13:53 GMT
Thinking more about the singleton aspect, I guess this is mostly an issue for blackbox containers,
where manifest/managed containers will mitigate at least some of the singleton failure delays
by prewarm/stemcell containers. 

So in the case of singleton failure, impacts would be:
- managed containers once prewarms are exhausted (may be improved by being more intelligent
about prewarm pool sizing based on load etc)
- managed containers that don’t match any prewarms (similar - if prewarm pool is dynamically
configured based on load, this is less problem)
- blackbox containers (no help)

If the failover of the singleton is too long (I think it will be based on cluster size, oldest
node becomes the singleton host iirc), I think we need to consider how containers can launch
in the meantime. A first step might be to test out the singleton behavior in the cluster of
various sizes.

> On Aug 16, 2018, at 11:01 AM, Tyson Norris <> wrote:
> A couple comments on singleton:
> - use of cluster singleton will introduce a new single point of failure - from time of
singleton node failure, to single resurrection on a different instance, will be an outage
from the point of view of any ContainerRouter that does not already have a warm+free container
to service an activation
> - resurrecting the singleton will require transferring or rebuilding the state when recovery
occurs - in my experience this was tricky, and requires replicating the data (which will be
slightly stale, but better than rebuilding from nothing); I don’t recall the handover delay
(to transfer singleton to a new akka cluster node) when I tried last, but I think it was not
as fast as I hoped it would be.
> I don’t have a great suggestion for the singleton failure case, but would like to consider
this carefully, and discuss the ramifications (which may or may not be tolerable) before pursuing
this particular aspect of the design.
> On prioritization:
> - if concurrency is enabled for an action, this is another prioritization aspect, of
sorts - if the action supports concurrency, there is no reason (except for destruction coordination…)
that it cannot be shared across shards. This could be added later, but may be worth considering
since there is a general reuse problem where a series of activations that arrives at different
ContainerRouters will create a new container in each, while they could be reused (and avoid
creating new containers) if concurrency is tolerated in that container. This would only (ha
ha) require changing how container destroy works, where it cannot be destroyed until the last
ContainerRouter is done with it. And if container destruction is coordinated in this way to
increase reuse, it would also be good to coordinate construction (don’t concurrently construct
the same container for multiple containerRouters IFF a single container would enable concurrent
activations once it is created). I’m not sure if others are desiring this level of container
reuse, but if so, it would be worth considering these aspects (sharding/isolation vs sharing/coordination)
as part of any redesign.
> THanks
> Tyson
> On Aug 15, 2018, at 8:55 AM, Carlos Santana <<>>
> I think we should add a section on prioritization for blocking vs. async
> invokes (none blocking actions a triggers)
> The front door has the luxury of known some intent from the incoming
> request, I feel it would make sense to high priority to blocking invokes,
> and for async they go straight to the queue to be pick up by the system to
> eventually run, even if it takes 10 times longer to execute than a blocking
> invoke, for example a webaction would take 10ms vs. a DB trigger fire, or a
> async webhook takes 100ms.
> Also the controller takes time to convert a trigger and process the rules,
> this is something that can also be taken out of hot path.
> So I'm just saying we could optimize the system because we know if the
> incoming request is a hot or hotter path :-)
> -- Carlos

View raw message