cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Burwell <>
Subject Re: [DISCUSS/PROPOSAL] Upgrading Driver Model
Date Wed, 21 Aug 2013 00:46:42 GMT

My response does hand wave two important issues -- hot code reloading and PermGen leakage.
 These are tricky, but well trod issues that can be solved in variety of ways (e.g. instrumentation,
class loaders, OSGi).  It would require a some research/experimentation to determine the best
approach particularly when using a lightweight threading model. 


On Aug 20, 2013, at 8:31 PM, John Burwell <> wrote:

> Darren,
> Actually, loading and unloading aren't difficult if resource management and drivers work
within the following constraints/assumptions:
> Drivers are transient and stateless
> A driver instance is assigned per resource managed (i.e. no singletons)
> A lightweight thread and mailbox (i.e. actor model) are assigned per resource managed
(outlined in the presentation referenced below)
> Based on these constraints and assumptions, the following upgrade process could be implemented:
> Load and verify new driver version to make it available
> Notify the supervisor processes of each affected resource that a new driver is available
> Upon completion of the current message being processed by its associated actor, the supervisor
kills and respawns the actor managing its associated resource 
> As part of startup, the supervisor injects an instance of the new driver version and
the actor resumes processing messages in its mailbox
> This process mirrors the process that would occur on management server startup for each
resource minus killing an existing actor instance.  Eventually, the system will upgrade the
driver without loss of operation.  More sophisticated policies could be added, but I think
this approach would be a solid default upgrade behavior.  As a bonus, this same approach could
also be applied to global configuration settings -- allowing the system to apply changes to
these values without restarting the system.
> In summary, CloudStack and Eclipse are very different types of systems.  Eclipse is a
desktop application implementing complex workflows, user interactions, and management of shared
state (e.g. project structure, AST, compiler status, etc).  In contrast, CloudStack is an
eventually consistent distributed system performing automation control.  As such, its requirements
plugin requirements are not only very different, but IMHO, much simpler.
> Thanks,
> -John
> On Aug 20, 2013, at 7:44 PM, Darren Shepherd <> wrote:
>> I know this isn't terribly useful, but I've been drawing a lot of squares and circles
and lines that connect those squares and circles lately and I have a lot of architectural
ideas for CloudStack.  At the rate I'm going it will take me about two weeks to put together
a discussion/proposal for the community.  What I'm thinking is a superset of what you've listed
out and should align with your idea of a CAR.  The focus has a a lot to do with modularity
and extensibility.  
>> So more to come soon....  I will say one thing though, is with java you end up having
a hard time doing dynamic load and unloading of modules.  There's plenty of frameworks that
try really hard to do this right, like OSGI, but its darn near impossible to do it right because
of class loading and GC issues (and that's why Eclipse has you restart after installing plugs
even though it is OSGi).   
>> I do believe that CloudStack should be possible of zero downtime maintenance and
have ideas around that, but at the end of the day, for plenty of practical reasons, you still
need a JVM restart if modules change.   
>> Darren
>> On Aug 20, 2013, at 3:39 PM, Mike Tutkowski <>
>>> I agree, John - let's get consensus first, then talk time tables.
>>> On Tue, Aug 20, 2013 at 4:31 PM, John Burwell <> wrote:
>>>> Mike,
>>>> Before we can dig into timelines or implementations, I think we need to
>>>> get consensus on the problem to solved and the goals.  Once we have a
>>>> proper understanding of the scope, I believe we can chunk the across a set
>>>> of development lifecycle.  The subject is vast, but it also has a far
>>>> reaching impact to both the storage and network layer evolution efforts.
>>>> As such, I believe we need to start addressing it as part of the next
>>>> release.
>>>> As a separate thread, we need to discuss the timeline for the next
>>>> release.  I think we need to avoid the time compression caused by the
>>>> overlap of the 4.1 stabilization effort and 4.2 development.  Therefore,
>>>> don't think we should consider development of the next release started
>>>> until the first 4.2 RC is released.  I will try to open a separate discuss
>>>> thread for this topic, as well as, tying of the discussion of release code
>>>> names.
>>>> Thanks,
>>>> -John
>>>> On Aug 20, 2013, at 6:22 PM, Mike Tutkowski <>
>>>> wrote:
>>>>> Hey John,
>>>>> I think this is some great stuff. Thanks for the write up.
>>>>> It looks like you have ideas around what might go into a first release
>>>>> this plug-in framework. Were you thinking we'd have enough time to
>>>> squeeze
>>>>> that first rev into 4.3. I'm just wondering (it's not a huge deal to
>>>>> that release for this) because we would only have about five weeks.
>>>>> Thanks
>>>>> On Tue, Aug 20, 2013 at 3:43 PM, John Burwell <>
>>>> wrote:
>>>>>> All,
>>>>>> In capturing my thoughts on storage, my thinking backed into the
>>>>>> model.  While we have the beginnings of such a model today, I see
>>>>>> following deficiencies:
>>>>>> 1. *Multiple Models*: The Storage, Hypervisor, and Security layers
>>>>>> each have a slightly different model for allowing system
>>>> functionality to
>>>>>> be extended/substituted.  These differences increase the barrier
>>>> entry
>>>>>> for vendors seeking to extend CloudStack and accrete code paths to
>>>>>> maintained and verified.
>>>>>> 2. *Leaky Abstraction*:  Plugins are registered through a Spring
>>>>>> configuration file.  In addition to being operator unfriendly (most
>>>>>> sysadmins are not Spring experts nor do they want to be), we expose
>>>> the
>>>>>> core bootstrapping mechanism to operators.  Therefore, a
>>>> misconfiguration
>>>>>> could negatively impact the injection/configuration of internal
>>>> management
>>>>>> server components.  Essentially handing them a loaded shotgun pointed
>>>> at
>>>>>> our right foot.
>>>>>> 3. *Nondeterministic Load/Unload Model*:  Because the core loading
>>>>>> mechanism is Spring, the management has little control over the
>>>> timing and
>>>>>> order of component loading/unloading.  Changes to the Management
>>>> Server's
>>>>>> component dependency graph could break a driver by causing it to
>>>> started
>>>>>> at an unexpected time.
>>>>>> 4. *Lack of Execution Isolation*: As a Spring component, plugins
>>>>>> loaded into the same execution context as core management server
>>>>>> components.  Therefore, an errant plugin can corrupt the entire
>>>> management
>>>>>> server.
>>>>>> For next revision of the plugin/driver mechanism, I would like see
>>>>>> migrate towards a standard pluggable driver model that supports all
>>>> the
>>>>>> management server's extension points (e.g. network devices, storage
>>>>>> devices, hypervisors, etc) with the following capabilities:
>>>>>> - *Consolidated Lifecycle and Startup Procedure*:  Drivers share
>>>>>> common state machine and categorization (e.g. network, storage,
>>>> hypervisor,
>>>>>> etc) that permits the deterministic calculation of initialization
>>>>>> destruction order (i.e. network layer drivers -> storage layer
>>>> drivers ->
>>>>>> hypervisor drivers).  Plugin inter-dependencies would be supported
>>>> between
>>>>>> plugins sharing the same category.
>>>>>> - *In-process Installation and Upgrade*: Adding or upgrading a driver
>>>>>> does not require the management server to be restarted.  This
>>>> capability
>>>>>> implies a system that supports the simultaneous execution of multiple
>>>>>> driver versions and the ability to suspend continued execution work
>>>> on a
>>>>>> resource while the underlying driver instance is replaced.
>>>>>> - *Execution Isolation*: The deployment packaging and execution
>>>>>> environment supports different (and potentially conflicting) versions
>>>> of
>>>>>> dependencies to be simultaneously used.  Additionally, plugins would
>>>> be
>>>>>> sufficiently sandboxed to protect the management server against driver
>>>>>> instability.
>>>>>> - *Extension Data Model*: Drivers provide a property bag with a
>>>>>> metadata descriptor to validate and render vendor specific data.
>>>>>> contents of this property bag will provided to every driver operation
>>>>>> invocation at runtime.  The metadata descriptor would be a lightweight
>>>>>> description that provides a label resource key, a description
>>>> resource key,
>>>>>> data type (string, date, number, boolean), required flag, and optional
>>>>>> length limit.
>>>>>> - *Introspection: Administrative APIs/UIs allow operators to
>>>>>> understand the configuration of the drivers in the system, their
>>>>>> configuration, and their current state.*
>>>>>> - *Discoverability*: Optionally, drivers can be discovered via a
>>>>>> project repository definition (similar to Yum) allowing drivers to
>>>>>> remotely acquired and operators to be notified regarding update
>>>>>> availability.  The project would also provide, free of charge,
>>>> certificates
>>>>>> to sign plugins.  This mechanism would support local mirroring to
>>>> support
>>>>>> air gapped management networks.
>>>>>> Fundamentally, I do not want to turn CloudStack into an erector set
>>>>>> more screws than nuts which is a risk with highly pluggable
>>>> architectures.
>>>>>> As such, I think we would need to tightly bound the scope of drivers
>>>>>> their behaviors to prevent the loss system usability and stability.
>>>>>> thinking is that drivers would be packaged into a custom JAR, CAR
>>>>>> (CloudStack ARchive), that would be structured as followed:
>>>>>> - META-INF
>>>>>>    - MANIFEST.MF
>>>>>>    - driver.yaml (driver metadata(e.g. version, name, description,
>>>>>>    etc) serialized in YAML format)
>>>>>>    - LICENSE (a text file containing the driver's license)
>>>>>> - lib (driver dependencies)
>>>>>> - classes (driver implementation)
>>>>>> - resources (driver message files and potentially JS resources)
>>>>>> The management server would acquire drivers through a simple scan
of a
>>>> URL
>>>>>> (e.g. file directory, S3 bucket, etc).  For every CAR object found,
>>>>>> management server would create an execution environment (likely a
>>>> dedicated
>>>>>> ExecutorService and Classloader), and transition the state of the
>>>> driver to
>>>>>> Running (the exact state model would need to be worked out).  To
>>>> really
>>>>>> nice, we could develop a custom Ant task/Maven plugin/Gradle plugin
>>>>>> create CARs.   I can also imagine an opportunities to add hooks to
>>>>>> model to register instrumentation information with JMX and
>>>> authorization.
>>>>>> To keep the scope of this email confined, we would introduce the
>>>>>> notion of a Resource, and (hand wave hand wave) eventually
>>>> compartmentalize
>>>>>> the execution of work around a resource [1].  This (hand waved)
>>>>>> compartmentalization would allow us the controls necessary to safely
>>>>>> reliably perform in-place driver upgrades.  For an initial release,
>>>> would
>>>>>> recommend implementing the abstractions, loading mechanism, extension
>>>> data
>>>>>> model, and discovery features.  With these capabilities in place,
>>>> could
>>>>>> attack the in-place upgrade model.
>>>>>> If we were to adopt such a pluggable capability, we would have the
>>>>>> opportunity to decouple the vendor and CloudStack release schedules.
>>>> For
>>>>>> example, if a vendor were introducing a new product that required
a new
>>>> or
>>>>>> updated driver, they would no longer need to wait for a CloudStack
>>>> release
>>>>>> to support it.  They would also gain the ability to fix high priority
>>>>>> defects in the same manner.
>>>>>> I have hand waved a number of issues that would need to be resolved
>>>> before
>>>>>> such an approach could be implemented.  However, I think we need
>>>> decide,
>>>>>> as a community, that it worth devoting energy and effort to enhancing
>>>> the
>>>>>> plugin/driver model and the goals of that effort before driving head
>>>> first
>>>>>> into the deep rabbit hole of design/implementation.
>>>>>> Thoughts? (/me ducks)
>>>>>> -John
>>>>>> [1]: My opinions on the matter from CloudStack Collab 2013 ->
>>>>> --
>>>>> *Mike Tutkowski*
>>>>> *Senior CloudStack Developer, SolidFire Inc.*
>>>>> e:
>>>>> o: 303.746.7302
>>>>> Advancing the way the world uses the
>>>>> cloud<>
>>>>> *™*
>>> -- 
>>> *Mike Tutkowski*
>>> *Senior CloudStack Developer, SolidFire Inc.*
>>> e:
>>> o: 303.746.7302
>>> Advancing the way the world uses the
>>> cloud<>
>>> *™*

View raw message