Return-Path: X-Original-To: apmail-cloudstack-dev-archive@www.apache.org Delivered-To: apmail-cloudstack-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AAFF810024 for ; Mon, 25 Nov 2013 19:05:46 +0000 (UTC) Received: (qmail 67143 invoked by uid 500); 25 Nov 2013 19:05:46 -0000 Delivered-To: apmail-cloudstack-dev-archive@cloudstack.apache.org Received: (qmail 67108 invoked by uid 500); 25 Nov 2013 19:05:46 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 67100 invoked by uid 99); 25 Nov 2013 19:05:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Nov 2013 19:05:46 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Edison.su@citrix.com designates 66.165.176.63 as permitted sender) Received: from [66.165.176.63] (HELO SMTP02.CITRIX.COM) (66.165.176.63) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Nov 2013 19:05:42 +0000 X-IronPort-AV: E=Sophos;i="4.93,769,1378857600"; d="scan'208";a="75524720" Received: from sjcpex01cl02.citrite.net ([10.216.14.144]) by FTLPIPO02.CITRIX.COM with ESMTP/TLS/AES128-SHA; 25 Nov 2013 19:05:11 +0000 Received: from SJCPEX01CL01.citrite.net ([169.254.1.101]) by SJCPEX01CL02.citrite.net ([10.216.14.144]) with mapi id 14.02.0342.004; Mon, 25 Nov 2013 11:05:10 -0800 From: Edison Su To: "dev@cloudstack.apache.org" , John Burwell Subject: RE: Resource Management/Locking [was: Re: What would be your ideal solution?] Thread-Topic: Resource Management/Locking [was: Re: What would be your ideal solution?] Thread-Index: AQHO6fSnv7aHcIgjSEa7+bsIPepIO5o2tnOA//+TESA= Date: Mon, 25 Nov 2013 19:05:09 +0000 Message-ID: <77B337AF224FD84CBF8401947098DD8715C343BC@SJCPEX01CL01.citrite.net> References: <3EE3BDE0-13D9-4130-9BAC-E7C6C50C7DDD@gmail.com> <5C3EB2A4-7576-4D3D-BFB4-A7AF315C13BE@basho.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.210.61.49] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Won't the architecture used by Mesos/Omega solve the resource management/lo= cking issue: http://mesos.apache.org/documentation/latest/mesos-architecture/ http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf Basically, one server holds all the resource information in memory (cpu/mem= ory/disk/ip address etc) about the whole data center, all the hypervisor ho= sts or any other resource entities are connecting to this server to report/= update its own resource. As there is only one master server, CAP theorem is= invalid. > -----Original Message----- > From: Darren Shepherd [mailto:darren.s.shepherd@gmail.com] > Sent: Monday, November 25, 2013 9:17 AM > To: John Burwell > Cc: dev@cloudstack.apache.org > Subject: Re: Resource Management/Locking [was: Re: What would be your > ideal solution?] >=20 > You bring up some interesting points. I really need to digest this furth= er. > From a high level I think I agree, but there are a lot of implied details= of what > you've said. >=20 > Darren >=20 >=20 > On Mon, Nov 25, 2013 at 8:39 AM, John Burwell > wrote: >=20 > > Darren, > > > > I originally presented my thoughts on this subject at CCC13 [1]. > > Fundamentally, I see CloudStack as having two distinct tiers - > > orchestration management and automation control. The orchestration > > tier coordinates the automation control layer to fulfill user goals > > (e.g. create a VM instance, alter a network route, snapshot a volume, > > etc) constrained by policies defined by the operator (e.g. > > multi-tenacy boundaries, ACLs, quotas, etc). This layer must always > > be available to take new requests, and to report the best available > > infrastructure state information. Since execution of work is > > guaranteed on completion of a request, this layer may pend work to be > completed when the appropriate devices become available. > > > > The automation control tier translates logical units of work to > > underlying infrastructure component APIs. Upon completion of unit of > > work's execution, the state of a device (e.g. hypervisor, storage > > device, network switch, router, etc) matches the state managed by the > > orchestration tier at the time unit of work was created. In order to > > ensure that the state of the underlying devices remains consistent, > > these units of work must be executed serially. Permitting concurrent > changes to resources creates race > > conditions that lead to resource overcommitment and state divergence. = A > > symptom of this phenomenon are the myriad of scripts operators write > > to "synchronize" state between the CloudStack database and their > hypervisors. > > Another is the example provided below is the rapid create-destroy > > which can (and often does) leave dangling resources due to race > > conditions between the two operations. > > > > In order to provide reliability, CloudStack vertically partitions the > > infrastructure into zones (independent power source/network uplink > > combination) sub-divided into pods (racks). At this time, regions are > > largely notional, as such, as are not partitions at this time. > > Between the user's zone selection and our allocators distribution of > > resources across pods, the system attempts to distribute resources > > widely as possible across these partitions to provide resilience > > against a variety infrastructure failures (e.g. power loss, network > > uplink disruption, switch failures, etc). In order maximize this > > resilience, the control plane (orchestration > > + automation tiers) must be to operate on all available partitions. > > + For > > example, if we have two (2) zones (A & B) and twenty (20) pods per > > zone, we should be able to take and execute work in Zone A when one or > > more pods is lost, as well as, when taking and executing work in Zone > > B when Zone B has failed. > > > > CloudStack is an eventually consistent system in that the state > > reflected in the orchestration tier will (optimistically) differ from > > the state of the underlying infrastructure (managed by the automation > tier). > > Furthermore, the system has a partitioning model to provide > > resilience in the face of a variety of logical and physical failures. > > However, the automation control tier requires strictly consistent > > operations. Based on these definitions, the system appears to violate > > the CAP theorem [2] (Brewer!). The separation of the system into two > > distinct tiers isolates these characteristics, but the boundary > > between them must be carefully implemented to ensure that the > > consistency requirements of the automation tier are not leaked to the > orchestration tier. > > > > To properly implement this boundary, I think we should split the > > orchestration and automation control tiers into separate physical > > processes communicating via an RPC mechanism - allowing the > automation > > control tier to completely encapsulate its work distribution model. > > In my mind, the tricky wicket is providing serialization and partition > > tolerance in the automation control tier. Realistically, there two > > options - explicit and implicit locking models. Explicit locking > > models employ an external coordination mechanism to coordinate > exclusive access to resources (e.g. > > RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with > > this model is ensuring the availability of the locking mechanism in > > the face of partition - forcing CloudStack operators to ensure that > > they have deployed the underlying mechanism in a partition tolerant > > manner (e.g. don't locate all of the replicas in the same pod, deploy a > cluster per zone, etc). > > Additionally, the durability introduced by these mechanisms inhibits > > the self-healing due to lock staleness. > > > > In contrast, an implicit lock model structures the runtime execution > > model to provide exclusive access to a resource and model the > > partitioning scheme. One such model is to provide a single work queue > > (mailbox) and consuming process (actor) per resource. The > > orchestration tier provides a description of the partition and > > resource definitions to the automation control tier. The automation > > control tier creates a supervisor per partition which in turn manage > > process creation per resource. Therefore, process creation and > > destruction creates an implicit lock. Since automation control tier > > does not persist data in this model, The crash of a supervisor and/or > > process (supervisors are simply specialized processes) releases the > > implicit lock, and signals a re-execution of the supervisor/process > > allocation process. The following high-level process describes > > creation allocation (hand waves certain details such as back pressure a= nd > throttling): > > > > > > 1. The automation control layer receives a resource definition (e.g. > > zone description, VM definition, volume information, etc). These > requests > > are processed by the owning partition supervisor exclusively in orde= r of > > receipt. Therefore, the automation control tier views the world as = a tree > > of partitions and resources. > > 2. The partition supervisor creates the process (and the associated > > mailbox) - providing it with the initial state. The process state i= s > > Initialized. > > 3. The process synchronizes the state of the underlying resource wit= h > > the state provided. Upon successful completion of state synchroniza= tion, > > the state of the process becomes Ready. Only Ready processes can > consume > > units of work from their mailboxes. The processes crashes. All sta= te > > transitions and crashes are reported to interested parties through a= n > > asynchronous event reporting mechanism including the id of the unit = of > work > > the device represents. > > > > > > The Ready state means that the underlying device is in a useable state > > consistent with the last unit of work executed. A process crashes > > when it is unable to bring the device into a state consistent with the > > unit of work being executed (a process crash also destroys the > > associated mailbox - flushing pending work). This event initiates > > execution of allocation process (above) until the process can be > > re-allocated in a Ready state (again throttling is hand waved for the > > purposes of brevity). The state synchronization step converges the > > actual state of the device with changes that occurred during > > unavailability. When a unit of work fails to be executed, the > > orchestration tier determines the appropriate recovery strategy (e.g. > > re-allocate work to another resource, wait for the availability of the > resource, fail the operation, etc). > > > > The association of one process per resource provides exclusive access > > to the resource without the requirement of an external locking > > mechanism. A mailbox per process provides orders pending units of > > work. Together, they provide serialization of operation execution. > > In the example provided, a unit of work would be submitted to create a > > VM and a second unit of work would be submitted to destroy it. The > > creation would be completely executed followed by the destruction > > (assuming no failures). Therefore, the VM will briefly exist before > > being destroyed. In conduction with a process location mechanism, the > > system can place the processes associated with resources in the > > appropriate partition allowing the system properly self heal, manage > > its own scalability (thinking lightweight system VMs), and > > systematically enforce partition tolerance (the operator was nice > > enough to describe their infrastructure - we should use it to ensure > resilience of CloudStack and their infrastructure). > > > > Until relatively recently, the implicit locking model described was > > infeasible on the JVM. Using native Java threads, a server would be > > limited to controlling (at best) a few hundred resources. However, > > lightweight threading models implemented by libraries/frameworks such > > as Akka [3], Quasar [4], and Erjang [5] can scale to millions of > > "threads" on reasonability sized servers and provide the > > supervisor/actor/mailbox abstractions described above. Most > > importantly, this approach does not require operators to become > > operationally knowledgeable of yet another platform/component. In > > short, I believe we can encapsulate these requirements in the > > management server (orchestration + automation control > > tiers) - keeping the operational footprint of the system proportional > > to the deployment without sacrificing resilience. Finally, it > > provides the foundation for proper collection of instrumentation > > information and process control/monitoring across data centers. > > > > Admittedly, I have hand waved some significant issues that would beed > > to be resolved. I believe they are all resolvable, but it would take > > discussion to determine the best approach to them. Transforming > > CloudStack to such a model would not be trivial, but I believe it > > would be worth the > > (significant) effort as it would make CloudStack one of the most > > scalable and resilient cloud orchestration/management platforms availab= le. > > > > Thanks, > > -John > > > > [1]: > > http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie- > cloud- > > stack-distributed-process-management > > [2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf > > [3]: http://akka.io > > [4]: https://github.com/puniverse/quasar > > [5]: https://github.com/trifork/erjang/wiki > > > > P.S. I have CC'ed the developer mailing list. All conversations at > > this level of detail should be initiated and occur on the mailing list > > to ensure transparency with the community. > > > > On Nov 22, 2013, at 3:49 PM, Darren Shepherd > > > > wrote: > > > > 140 characters are not productive. > > > > What would be your idea way to do distributed concurrency control? > > Simple use case. Server 1 receives a request to start a VM 1. Server > > 2 receives a request to delete VM 1. What do you do? > > > > Darren > > > > > >