Return-Path: X-Original-To: apmail-cloudstack-dev-archive@www.apache.org Delivered-To: apmail-cloudstack-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2827A10BB9 for ; Mon, 25 Nov 2013 17:17:22 +0000 (UTC) Received: (qmail 20685 invoked by uid 500); 25 Nov 2013 17:17:21 -0000 Delivered-To: apmail-cloudstack-dev-archive@cloudstack.apache.org Received: (qmail 20304 invoked by uid 500); 25 Nov 2013 17:17:16 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 20296 invoked by uid 99); 25 Nov 2013 17:17:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Nov 2013 17:17:15 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of darren.s.shepherd@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-we0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Nov 2013 17:17:11 +0000 Received: by mail-we0-f170.google.com with SMTP id w61so4206566wes.29 for ; Mon, 25 Nov 2013 09:16:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=RChB+t6f3vZRlr6Ej+saL7SAUVvOsPUAWnu9haslRL8=; b=yoSj/qFFNLTSa/ieWK7arTKH1YcQzfSMyhVImc2lWwztq9FP0TiC3elEFvOR6d05cX iO5QOrGRJ7k3mMY02JubZjqd+6D2uueUGvfZwqVFHqkDIA+/abkUs3sigSMv7vFzvpwr VQNPHnJwDafyE6HNEruW2ifJ5bVUCD9qXDxGxEftxi1YdY0RsOMpUNR9hCrMTuWYfgw3 jhJq9AgPaAPcg57tvPpJEglzExG0/xZ4BcbvcPwNaNBCOV++0b0a3vXSxZwfd6vcyhln d8A483tf3jS+Q11E6M+dQQxP1HAKU/M4Jy9REUzBh4SKicQ5j5Akem4vYrK/P+DrWnJE cIQw== MIME-Version: 1.0 X-Received: by 10.194.143.4 with SMTP id sa4mr9503652wjb.4.1385399809856; Mon, 25 Nov 2013 09:16:49 -0800 (PST) Received: by 10.217.10.198 with HTTP; Mon, 25 Nov 2013 09:16:49 -0800 (PST) In-Reply-To: <5C3EB2A4-7576-4D3D-BFB4-A7AF315C13BE@basho.com> References: <3EE3BDE0-13D9-4130-9BAC-E7C6C50C7DDD@gmail.com> <5C3EB2A4-7576-4D3D-BFB4-A7AF315C13BE@basho.com> Date: Mon, 25 Nov 2013 10:16:49 -0700 Message-ID: Subject: Re: Resource Management/Locking [was: Re: What would be your ideal solution?] From: Darren Shepherd To: John Burwell Cc: "dev@cloudstack.apache.org" Content-Type: multipart/alternative; boundary=089e0112c4f889db8004ec03874a X-Virus-Checked: Checked by ClamAV on apache.org --089e0112c4f889db8004ec03874a Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable You bring up some interesting points. I really need to digest this further. From a high level I think I agree, but there are a lot of implied details of what you've said. Darren On Mon, Nov 25, 2013 at 8:39 AM, John Burwell wrote: > Darren, > > I originally presented my thoughts on this subject at CCC13 [1]. > Fundamentally, I see CloudStack as having two distinct tiers =97 > orchestration management and automation control. The orchestration tier > coordinates the automation control layer to fulfill user goals (e.g. crea= te > a VM instance, alter a network route, snapshot a volume, etc) constrained > by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs, > quotas, etc). This layer must always be available to take new requests, > and to report the best available infrastructure state information. Since > execution of work is guaranteed on completion of a request, this layer ma= y > pend work to be completed when the appropriate devices become available. > > The automation control tier translates logical units of work to underlyin= g > infrastructure component APIs. Upon completion of unit of work=92s > execution, the state of a device (e.g. hypervisor, storage device, networ= k > switch, router, etc) matches the state managed by the orchestration tier = at > the time unit of work was created. In order to ensure that the state of > the underlying devices remains consistent, these units of work must be > executed serially. Permitting concurrent changes to resources creates ra= ce > conditions that lead to resource overcommitment and state divergence. A > symptom of this phenomenon are the myriad of scripts operators write to > =93synchronize=94 state between the CloudStack database and their hypervi= sors. > Another is the example provided below is the rapid create-destroy which > can (and often does) leave dangling resources due to race conditions > between the two operations. > > In order to provide reliability, CloudStack vertically partitions the > infrastructure into zones (independent power source/network uplink > combination) sub-divided into pods (racks). At this time, regions are > largely notional, as such, as are not partitions at this time. Between t= he > user=92s zone selection and our allocators distribution of resources acro= ss > pods, the system attempts to distribute resources widely as possible acro= ss > these partitions to provide resilience against a variety infrastructure > failures (e.g. power loss, network uplink disruption, switch failures, > etc). In order maximize this resilience, the control plane (orchestratio= n > + automation tiers) must be to operate on all available partitions. For > example, if we have two (2) zones (A & B) and twenty (20) pods per zone, = we > should be able to take and execute work in Zone A when one or more pods i= s > lost, as well as, when taking and executing work in Zone B when Zone B ha= s > failed. > > CloudStack is an eventually consistent system in that the state reflected > in the orchestration tier will (optimistically) differ from the state of > the underlying infrastructure (managed by the automation tier). > Furthermore, the system has a partitioning model to provide resilience i= n > the face of a variety of logical and physical failures. However, the > automation control tier requires strictly consistent operations. Based o= n > these definitions, the system appears to violate the CAP theorem [2] > (Brewer!). The separation of the system into two distinct tiers isolates > these characteristics, but the boundary between them must be carefully > implemented to ensure that the consistency requirements of the automation > tier are not leaked to the orchestration tier. > > To properly implement this boundary, I think we should split the > orchestration and automation control tiers into separate physical process= es > communicating via an RPC mechanism =97 allowing the automation control ti= er > to completely encapsulate its work distribution model. In my mind, the > tricky wicket is providing serialization and partition tolerance in the > automation control tier. Realistically, there two options =97 explicit a= nd > implicit locking models. Explicit locking models employ an external > coordination mechanism to coordinate exclusive access to resources (e.g. > RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with this > model is ensuring the availability of the locking mechanism in the face o= f > partition =97 forcing CloudStack operators to ensure that they have deplo= yed > the underlying mechanism in a partition tolerant manner (e.g. don=92t loc= ate > all of the replicas in the same pod, deploy a cluster per zone, etc). > Additionally, the durability introduced by these mechanisms inhibits the > self-healing due to lock staleness. > > In contrast, an implicit lock model structures the runtime execution mode= l > to provide exclusive access to a resource and model the partitioning > scheme. One such model is to provide a single work queue (mailbox) and > consuming process (actor) per resource. The orchestration tier provides = a > description of the partition and resource definitions to the automation > control tier. The automation control tier creates a supervisor per > partition which in turn manage process creation per resource. Therefore, > process creation and destruction creates an implicit lock. Since > automation control tier does not persist data in this model, The crash o= f > a supervisor and/or process (supervisors are simply specialized processes= ) > releases the implicit lock, and signals a re-execution of the > supervisor/process allocation process. The following high-level process > describes creation allocation (hand waves certain details such as back > pressure and throttling): > > > 1. The automation control layer receives a resource definition (e.g. > zone description, VM definition, volume information, etc). These requ= ests > are processed by the owning partition supervisor exclusively in order = of > receipt. Therefore, the automation control tier views the world as a = tree > of partitions and resources. > 2. The partition supervisor creates the process (and the associated > mailbox) =97 providing it with the initial state. The process state i= s > Initialized. > 3. The process synchronizes the state of the underlying resource with > the state provided. Upon successful completion of state synchronizati= on, > the state of the process becomes Ready. Only Ready processes can cons= ume > units of work from their mailboxes. The processes crashes. All state > transitions and crashes are reported to interested parties through an > asynchronous event reporting mechanism including the id of the unit of= work > the device represents. > > > The Ready state means that the underlying device is in a useable state > consistent with the last unit of work executed. A process crashes when i= t > is unable to bring the device into a state consistent with the unit of wo= rk > being executed (a process crash also destroys the associated mailbox =97 > flushing pending work). This event initiates execution of allocation > process (above) until the process can be re-allocated in a Ready state > (again throttling is hand waved for the purposes of brevity). The state > synchronization step converges the actual state of the device with change= s > that occurred during unavailability. When a unit of work fails to be > executed, the orchestration tier determines the appropriate recovery > strategy (e.g. re-allocate work to another resource, wait for the > availability of the resource, fail the operation, etc). > > The association of one process per resource provides exclusive access to > the resource without the requirement of an external locking mechanism. A > mailbox per process provides orders pending units of work. Together, the= y > provide serialization of operation execution. In the example provided, a > unit of work would be submitted to create a VM and a second unit of work > would be submitted to destroy it. The creation would be completely > executed followed by the destruction (assuming no failures). Therefore, > the VM will briefly exist before being destroyed. In conduction with a > process location mechanism, the system can place the processes associated > with resources in the appropriate partition allowing the system properly > self heal, manage its own scalability (thinking lightweight system VMs), > and systematically enforce partition tolerance (the operator was nice > enough to describe their infrastructure =97 we should use it to ensure > resilience of CloudStack and their infrastructure). > > Until relatively recently, the implicit locking model described was > infeasible on the JVM. Using native Java threads, a server would be > limited to controlling (at best) a few hundred resources. However, > lightweight threading models implemented by libraries/frameworks such as > Akka [3], Quasar [4], and Erjang [5] can scale to millions of =93threads= =94 on > reasonability sized servers and provide the supervisor/actor/mailbox > abstractions described above. Most importantly, this approach does not > require operators to become operationally knowledgeable of yet another > platform/component. In short, I believe we can encapsulate these > requirements in the management server (orchestration + automation control > tiers) =97 keeping the operational footprint of the system proportional t= o > the deployment without sacrificing resilience. Finally, it provides the > foundation for proper collection of instrumentation information and proce= ss > control/monitoring across data centers. > > Admittedly, I have hand waved some significant issues that would beed to > be resolved. I believe they are all resolvable, but it would take > discussion to determine the best approach to them. Transforming CloudSta= ck > to such a model would not be trivial, but I believe it would be worth the > (significant) effort as it would make CloudStack one of the most scalable > and resilient cloud orchestration/management platforms available. > > Thanks, > -John > > [1]: > http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-sta= ck-distributed-process-management > [2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf > [3]: http://akka.io > [4]: https://github.com/puniverse/quasar > [5]: https://github.com/trifork/erjang/wiki > > P.S. I have CC=92ed the developer mailing list. All conversations at th= is > level of detail should be initiated and occur on the mailing list to ensu= re > transparency with the community. > > On Nov 22, 2013, at 3:49 PM, Darren Shepherd > wrote: > > 140 characters are not productive. > > What would be your idea way to do distributed concurrency control? Simpl= e > use case. Server 1 receives a request to start a VM 1. Server 2 receive= s > a request to delete VM 1. What do you do? > > Darren > > > --089e0112c4f889db8004ec03874a--