From dev-return-2438-apmail-openwhisk-dev-archive=openwhisk.apache.org@openwhisk.apache.org Sun Aug 19 10:59:42 2018 Return-Path: X-Original-To: apmail-openwhisk-dev-archive@minotaur.apache.org Delivered-To: apmail-openwhisk-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2381C18C5E for ; Sun, 19 Aug 2018 10:59:42 +0000 (UTC) Received: (qmail 21806 invoked by uid 500); 19 Aug 2018 10:59:42 -0000 Delivered-To: apmail-openwhisk-dev-archive@openwhisk.apache.org Received: (qmail 21741 invoked by uid 500); 19 Aug 2018 10:59:41 -0000 Mailing-List: contact dev-help@openwhisk.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@openwhisk.apache.org Delivered-To: mailing list dev@openwhisk.apache.org Received: (qmail 21730 invoked by uid 99); 19 Aug 2018 10:59:41 -0000 Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Aug 2018 10:59:41 +0000 Received: from mail-it0-f48.google.com (mail-it0-f48.google.com [209.85.214.48]) by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id EEF284EF6 for ; Sun, 19 Aug 2018 10:59:40 +0000 (UTC) Received: by mail-it0-f48.google.com with SMTP id e14-v6so16937691itf.1 for ; Sun, 19 Aug 2018 03:59:40 -0700 (PDT) X-Gm-Message-State: AOUpUlFWgHB2CRNYIE3Fl/vpjF2nsaQeM+aXmLMtj47OHLyldYo3rZuz KBC4E4FWd4XjnsoICNqlhOUgy7G8Az6T384NRMs= X-Google-Smtp-Source: AA+uWPzCoj/9mefS5ohpPSBX+gaV/YSHsBQDD+XvrRx9qRU6rhXFpc16MKQ5RKTM01NRWhlRf2X0Wqz8jobu+ZnNfjU= X-Received: by 2002:a24:8189:: with SMTP id q131-v6mr8976317itd.154.1534676380346; Sun, 19 Aug 2018 03:59:40 -0700 (PDT) MIME-Version: 1.0 References: <1B6BCE64-4044-49F9-900E-E7B4813213BC@adobe.com> <768880F9-2775-4372-9F86-334154CED0DE@adobe.com> <2E4FEC4A-D952-47D2-9D16-1713BB050FA8@adobe.com> In-Reply-To: <2E4FEC4A-D952-47D2-9D16-1713BB050FA8@adobe.com> From: =?UTF-8?Q?Markus_Th=C3=B6mmes?= Date: Sun, 19 Aug 2018 12:59:28 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Proposal on a future architecture of OpenWhisk To: dev@openwhisk.apache.org Content-Type: multipart/alternative; boundary="0000000000007ef1cc0573c7b0df" --0000000000007ef1cc0573c7b0df Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Tyson, Am Fr., 17. Aug. 2018 um 23:45 Uhr schrieb Tyson Norris : > > If the failover of the singleton is too long (I think it will be based on > cluster size, oldest node becomes the singleton host iirc), I think we ne= ed > to consider how containers can launch in the meantime. A first step might > be to test out the singleton behavior in the cluster of various sizes. > > > I agree this bit of design is crucial, a few thoughts: > Pre-warm wouldn't help here, the ContainerRouters only know warm > containers. Pre-warming is managed by the ContainerManager. > > Ah right > > > Considering a fail-over scenario: We could consider sharing the state via > EventSourcing. That is: All state lives inside of frequently snapshotted > events and thus can be shared between multiple instances of the > ContainerManager seamlessly. Alternatively, we could also think about onl= y > working on persisted state. That way, a cold-standby model could fly. We > should make sure that the state is not "slightly stale" but rather both > instances see the same state at any point in time. I believe on that > cold-path of generating new containers, we can live with the extra-latenc= y > of persisting what we're doing as the path will still be dominated by the > container creation latency. > > Wasn=E2=80=99t clear if you mean not using ClusterSingleton? To be clear = in > ClusterSingleton case there are 2 issues: > - time it takes for akka ClusterSingletonManager to realize it needs to > start a new actor > - time it takes for the new actor to assume a usable state > > EventSourcing (or ext persistence) may help with the latter, but we will > need to be sure the former is tolerable to start with. > Here is an example test from akka source that may be useful (multi-jvm, > but all local): > > https://github.com/akka/akka/blob/009214ae07708e8144a279e71d06c4a504907e3= 1/akka-cluster-tools/src/multi-jvm/scala/akka/cluster/singleton/ClusterSing= letonManagerChaosSpec.scala > > Some things to consider, that I don=E2=80=99t know details of: > - will the size of cluster affect the singleton behavior in case of > failure? (I think so, but not sure, and what extent); in the simple test > above it takes ~6s for the replacement singleton to begin startup, but if > we have 100s of nodes, I=E2=80=99m not sure how much time it will take. (= I don=E2=80=99t > think this should be hard to test, but I haven=E2=80=99t done it) > - in case of hard crash, what is the singleton behavior? In graceful jvm > termination, I know the cluster behavior is good, but there is always thi= s > question about how downing nodes will be handled. If this critical piece = of > the system relies on akka cluster functionality, we will need to make sur= e > that the singleton can be reconstituted, both in case of graceful > termination (restart/deployment events) and non-graceful termination (har= d > vm crash, hard container crash) . This is ignoring more complicated cases > of extended network partitions, which will also have bad affects on many = of > the downstream systems. > I don't think we need to be eager to consider akka-cluster to be set in stone here. The singleton in my mind doesn't need to be clustered at all. Say we have a fully shared state through persistence or event-sourcing and a hot-standby model, couldn't we implement the fallback through routing in front of the active/passive ContainerManager pair? Once one goes unreachable, fall back to the other. > > > > Handover time as you say is crucial, but I'd say as it only impacts > container creation, we could live with, let's say, 5 seconds of > failover-downtime on this path? What's your experience been on singleton > failover? How long did it take? > > > Seconds in the simplest case, so I think we need to test it in a scaled > case (100s of cluster nodes), as well as the hard crash case (where not > downing the node may affect the cluster state). > > > > > On Aug 16, 2018, at 11:01 AM, Tyson Norris > > wrote: > > A couple comments on singleton: > - use of cluster singleton will introduce a new single point of failure > - from time of singleton node failure, to single resurrection on a > different instance, will be an outage from the point of view of any > ContainerRouter that does not already have a warm+free container to servi= ce > an activation > - resurrecting the singleton will require transferring or rebuilding the > state when recovery occurs - in my experience this was tricky, and requir= es > replicating the data (which will be slightly stale, but better than > rebuilding from nothing); I don=E2=80=99t recall the handover delay (to t= ransfer > singleton to a new akka cluster node) when I tried last, but I think it w= as > not as fast as I hoped it would be. > > I don=E2=80=99t have a great suggestion for the singleton failure case, b= ut > would like to consider this carefully, and discuss the ramifications (whi= ch > may or may not be tolerable) before pursuing this particular aspect of th= e > design. > > > On prioritization: > - if concurrency is enabled for an action, this is another > prioritization aspect, of sorts - if the action supports concurrency, the= re > is no reason (except for destruction coordination=E2=80=A6) that it canno= t be > shared across shards. This could be added later, but may be worth > considering since there is a general reuse problem where a series of > activations that arrives at different ContainerRouters will create a new > container in each, while they could be reused (and avoid creating new > containers) if concurrency is tolerated in that container. This would onl= y > (ha ha) require changing how container destroy works, where it cannot be > destroyed until the last ContainerRouter is done with it. And if containe= r > destruction is coordinated in this way to increase reuse, it would also b= e > good to coordinate construction (don=E2=80=99t concurrently construct the= same > container for multiple containerRouters IFF a single container would enab= le > concurrent activations once it is created). I=E2=80=99m not sure if other= s are > desiring this level of container reuse, but if so, it would be worth > considering these aspects (sharding/isolation vs sharing/coordination) as > part of any redesign. > > > Yes, I can see where you're heading here. I think this can be generalized= : > > Assume intra-container concurrency C and number of ContainerRouters R. > If C > R: Shard the "slots" on this container evenly across R. The > container can only be destroyed after you receive R acknowledgements of > doing so. > If C < R: Hand out 1 slot to C Routers, point the remaining Routers to th= e > ones that got slots. > > > Yes, mostly - I think there is also a case where destruction message is > revoked by the same router (receiving a new activation for the container > which it previously requested destruction of). But I think this is covere= d > in the details of tracking =E2=80=9Cafter you receive R acks of destructi= ons" > Hm, I don't think that case exists. Once a Router has acknowledged a revoke, it will remove the container from its state immediately. It will never revoke that acknowledgement therefore, but rather request a new resource if it finds it now has insufficient resources. Again, this might be my fault for not providing sequence diagrams for the algorithms I'm describing here. > > Concurrent creation: Batch creation requests while one container is being > created. Say you received a request for a new container that has C slots. > If there are more requests for that container arriving while it is being > created, don't act on them and fold the creation into the first one. Only > start creating a new container if the number of resource requests exceed = C. > > Does that make sense? I think in that model you can set C=3D1 and it work= s as > I envisioned it to work, or set it to C=3D200 and things will be shared e= ven > across routers. > > Side note: One detail about the pending concurrency impl today is that du= e > to the async nature of tracking the active activations within the > container, there is no guarantee (when C>1) that the number is exact, so = if > you specify C=3D200, you may actually get a different container at 195 or > 205. This is not really related to this discussion, but is based on the > current messaging/future behavior in ContainerPool/ContainerProxy, so > wanted to mention it explicitly, in case it matters to anyone. > Not relevant for this discussion, but: I think that is not the right way of approaching it. If you handle the concurrency metric asynchronously, you effectively allow for an unbounded number of request to reach your container (in reality bound by the number of CPU cores and the real concurrency that happens in your system). I think the ContainerPool should track these number consistently as well to be able to guarantee there are never more than C requests on a container. I believe that is crucial, especially for the C=3D1 case but might be relevant for other cases as well= . The whole point of this design is to accurately track that concurrency metric. > > Thanks > Tyson > > > > > > > > WDYT? > > THanks > Tyson > > On Aug 15, 2018, at 8:55 AM, Carlos Santana csantana23@gmail.com> > > wrote: > > I think we should add a section on prioritization for blocking vs. async > invokes (none blocking actions a triggers) > > The front door has the luxury of known some intent from the incoming > request, I feel it would make sense to high priority to blocking invokes, > and for async they go straight to the queue to be pick up by the system > to > eventually run, even if it takes 10 times longer to execute than a > blocking > invoke, for example a webaction would take 10ms vs. a DB trigger fire, > or a > async webhook takes 100ms. > > Also the controller takes time to convert a trigger and process the > rules, > this is something that can also be taken out of hot path. > > So I'm just saying we could optimize the system because we know if the > incoming request is a hot or hotter path :-) > > -- Carlos > > --0000000000007ef1cc0573c7b0df--