cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Burwell <jburw...@basho.com>
Subject Re: [DISCUSS/PROPOSAL] Upgrading Driver Model
Date Wed, 21 Aug 2013 07:00:15 GMT
Daan,

I have the following issues with OSGi: 

Complexity:  Building OSGi components adds a tremendous amount of complexity to both the building
drivers and debugging runtime issues.  Additionally, OSGi has a much broader feature set than
I think CloudStack needs to support.  Therefore, driver authors may use the feature set in
unanticipated way that create system instability.
Dependency Hell: OSGi requires 3rd party dependencies to be packaged as OSGi bundles.  In
practice, many third party libraries either have issues that prevent them from being bundles
or their OSGi bundled versions are behind mainline release.

As an additionally personal experience, I do not want to re-create the mess that is Eclipse
(i.e. an erector set with more screws than nuts).  In addition to its lack of reliability,
it is incredibly difficult to comprehend how the component configurations and relationships
are composed at runtime.

To be clear, I am not interested in creating a general purpose component/plugin model.  Fundamentally,
we need a simple, purpose-built component model focused on providing stability and reliability
through deterministic behavior rather than feature flexibility.  Unfortunately, both OSGi
and Spring's focus on flexibility the later make them ill-suited for our purposes.

Thanks,
-John

On Aug 21, 2013, at 2:31 AM, Daan Hoogland <daan.hoogland@gmail.com> wrote:

> John,
> 
> Nice work.
> Given the maturity of OSGi, I'd say lets see how it fits. One criteria
> would be can we limit the bundles that may be loaded based on what
> Cloudstack supports (and not allow loading pydev) if not we need to
> bake our own.
> 
> But though I think your work is valuable I disagree on designing our
> CARs from the get go without having explored usable options in the
> field first. A new type of YARs is not what the world or cloudstack
> needs. And given what you have written the main problem wll be finding
> a framework we can restrict to what we want, not one that can do all
> of it.
> 
> done shooting,
> Daan
> 
> On Wed, Aug 21, 2013 at 2:52 AM, Darren Shepherd
> <darren.s.shepherd@gmail.com> wrote:
>> Sure, I fully understand how it theoretically works, but I'm saying from a
>> practical perspective it always seems to fall apart.  What your describing
>> is done excellently in OSGI 4.2 Blueprint.  It's a beautiful framework that
>> allows you to expose services that can be dynamically updated at runtime.
>> 
>> The issues always happens with unloading.  I'll give you a real world
>> example.  As part of the servlet spec your supposed to be able to stop and
>> unload wars.  But in practice if you do it enough times you typically run
>> out of memory.  So one such issue was with commons logging (since fixed).
>> When you do getLogger(myclass.class) it would cache a reference of the Class
>> object to the actual log impl.  The commons logging jar is typically loaded
>> with a system classloader and but MyClass.class would be loaded in the
>> webapp classloader.  So when you stop the war there is a reference chain
>> system classloader -> logfactory -> Myclass -> webapp classloader.  So the
>> web app never gets GC'd.
>> 
>> So just pointing out the practical issues, that's it.
>> 
>> Darren
>> 
>> On Aug 20, 2013, at 5:31 PM, John Burwell <jburwell@basho.com> wrote:
>> 
>> Darren,
>> 
>> Actually, loading and unloading aren't difficult if resource management and
>> drivers work within the following constraints/assumptions:
>> 
>> Drivers are transient and stateless
>> A driver instance is assigned per resource managed (i.e. no singletons)
>> A lightweight thread and mailbox (i.e. actor model) are assigned per
>> resource managed (outlined in the presentation referenced below)
>> 
>> 
>> Based on these constraints and assumptions, the following upgrade process
>> could be implemented:
>> 
>> Load and verify new driver version to make it available
>> Notify the supervisor processes of each affected resource that a new driver
>> is available
>> Upon completion of the current message being processed by its associated
>> actor, the supervisor kills and respawns the actor managing its associated
>> resource
>> As part of startup, the supervisor injects an instance of the new driver
>> version and the actor resumes processing messages in its mailbox
>> 
>> 
>> This process mirrors the process that would occur on management server
>> startup for each resource minus killing an existing actor instance.
>> Eventually, the system will upgrade the driver without loss of operation.
>> More sophisticated policies could be added, but I think this approach would
>> be a solid default upgrade behavior.  As a bonus, this same approach could
>> also be applied to global configuration settings -- allowing the system to
>> apply changes to these values without restarting the system.
>> 
>> In summary, CloudStack and Eclipse are very different types of systems.
>> Eclipse is a desktop application implementing complex workflows, user
>> interactions, and management of shared state (e.g. project structure, AST,
>> compiler status, etc).  In contrast, CloudStack is an eventually consistent
>> distributed system performing automation control.  As such, its requirements
>> plugin requirements are not only very different, but IMHO, much simpler.
>> 
>> Thanks,
>> -John
>> 
>> On Aug 20, 2013, at 7:44 PM, Darren Shepherd <darren.s.shepherd@gmail.com>
>> wrote:
>> 
>> I know this isn't terribly useful, but I've been drawing a lot of squares
>> and circles and lines that connect those squares and circles lately and I
>> have a lot of architectural ideas for CloudStack.  At the rate I'm going it
>> will take me about two weeks to put together a discussion/proposal for the
>> community.  What I'm thinking is a superset of what you've listed out and
>> should align with your idea of a CAR.  The focus has a a lot to do with
>> modularity and extensibility.
>> 
>> So more to come soon....  I will say one thing though, is with java you end
>> up having a hard time doing dynamic load and unloading of modules.  There's
>> plenty of frameworks that try really hard to do this right, like OSGI, but
>> its darn near impossible to do it right because of class loading and GC
>> issues (and that's why Eclipse has you restart after installing plugs even
>> though it is OSGi).
>> 
>> I do believe that CloudStack should be possible of zero downtime maintenance
>> and have ideas around that, but at the end of the day, for plenty of
>> practical reasons, you still need a JVM restart if modules change.
>> 
>> Darren
>> 
>> On Aug 20, 2013, at 3:39 PM, Mike Tutkowski <mike.tutkowski@solidfire.com>
>> wrote:
>> 
>> I agree, John - let's get consensus first, then talk time tables.
>> 
>> 
>> On Tue, Aug 20, 2013 at 4:31 PM, John Burwell <jburwell@basho.com> wrote:
>> 
>> Mike,
>> 
>> Before we can dig into timelines or implementations, I think we need to
>> get consensus on the problem to solved and the goals.  Once we have a
>> proper understanding of the scope, I believe we can chunk the across a set
>> of development lifecycle.  The subject is vast, but it also has a far
>> reaching impact to both the storage and network layer evolution efforts.
>> As such, I believe we need to start addressing it as part of the next
>> release.
>> 
>> As a separate thread, we need to discuss the timeline for the next
>> release.  I think we need to avoid the time compression caused by the
>> overlap of the 4.1 stabilization effort and 4.2 development.  Therefore, I
>> don't think we should consider development of the next release started
>> until the first 4.2 RC is released.  I will try to open a separate discuss
>> thread for this topic, as well as, tying of the discussion of release code
>> names.
>> 
>> Thanks,
>> -John
>> 
>> On Aug 20, 2013, at 6:22 PM, Mike Tutkowski <mike.tutkowski@solidfire.com>
>> wrote:
>> 
>> Hey John,
>> 
>> I think this is some great stuff. Thanks for the write up.
>> 
>> It looks like you have ideas around what might go into a first release of
>> this plug-in framework. Were you thinking we'd have enough time to
>> 
>> squeeze
>> 
>> that first rev into 4.3. I'm just wondering (it's not a huge deal to hit
>> that release for this) because we would only have about five weeks.
>> 
>> Thanks
>> 
>> 
>> On Tue, Aug 20, 2013 at 3:43 PM, John Burwell <jburwell@basho.com>
>> 
>> wrote:
>> 
>> 
>> All,
>> 
>> In capturing my thoughts on storage, my thinking backed into the driver
>> model.  While we have the beginnings of such a model today, I see the
>> following deficiencies:
>> 
>> 
>> 1. *Multiple Models*: The Storage, Hypervisor, and Security layers
>> each have a slightly different model for allowing system
>> 
>> functionality to
>> 
>> be extended/substituted.  These differences increase the barrier of
>> 
>> entry
>> 
>> for vendors seeking to extend CloudStack and accrete code paths to be
>> maintained and verified.
>> 2. *Leaky Abstraction*:  Plugins are registered through a Spring
>> configuration file.  In addition to being operator unfriendly (most
>> sysadmins are not Spring experts nor do they want to be), we expose
>> 
>> the
>> 
>> core bootstrapping mechanism to operators.  Therefore, a
>> 
>> misconfiguration
>> 
>> could negatively impact the injection/configuration of internal
>> 
>> management
>> 
>> server components.  Essentially handing them a loaded shotgun pointed
>> 
>> at
>> 
>> our right foot.
>> 3. *Nondeterministic Load/Unload Model*:  Because the core loading
>> mechanism is Spring, the management has little control over the
>> 
>> timing and
>> 
>> order of component loading/unloading.  Changes to the Management
>> 
>> Server's
>> 
>> component dependency graph could break a driver by causing it to be
>> 
>> started
>> 
>> at an unexpected time.
>> 4. *Lack of Execution Isolation*: As a Spring component, plugins are
>> loaded into the same execution context as core management server
>> components.  Therefore, an errant plugin can corrupt the entire
>> 
>> management
>> 
>> server.
>> 
>> 
>> For next revision of the plugin/driver mechanism, I would like see us
>> migrate towards a standard pluggable driver model that supports all of
>> 
>> the
>> 
>> management server's extension points (e.g. network devices, storage
>> devices, hypervisors, etc) with the following capabilities:
>> 
>> 
>> - *Consolidated Lifecycle and Startup Procedure*:  Drivers share a
>> common state machine and categorization (e.g. network, storage,
>> 
>> hypervisor,
>> 
>> etc) that permits the deterministic calculation of initialization and
>> destruction order (i.e. network layer drivers -> storage layer
>> 
>> drivers ->
>> 
>> hypervisor drivers).  Plugin inter-dependencies would be supported
>> 
>> between
>> 
>> plugins sharing the same category.
>> - *In-process Installation and Upgrade*: Adding or upgrading a driver
>> does not require the management server to be restarted.  This
>> 
>> capability
>> 
>> implies a system that supports the simultaneous execution of multiple
>> driver versions and the ability to suspend continued execution work
>> 
>> on a
>> 
>> resource while the underlying driver instance is replaced.
>> - *Execution Isolation*: The deployment packaging and execution
>> environment supports different (and potentially conflicting) versions
>> 
>> of
>> 
>> dependencies to be simultaneously used.  Additionally, plugins would
>> 
>> be
>> 
>> sufficiently sandboxed to protect the management server against driver
>> instability.
>> - *Extension Data Model*: Drivers provide a property bag with a
>> metadata descriptor to validate and render vendor specific data.  The
>> contents of this property bag will provided to every driver operation
>> invocation at runtime.  The metadata descriptor would be a lightweight
>> description that provides a label resource key, a description
>> 
>> resource key,
>> 
>> data type (string, date, number, boolean), required flag, and optional
>> length limit.
>> - *Introspection: Administrative APIs/UIs allow operators to
>> understand the configuration of the drivers in the system, their
>> configuration, and their current state.*
>> - *Discoverability*: Optionally, drivers can be discovered via a
>> project repository definition (similar to Yum) allowing drivers to be
>> remotely acquired and operators to be notified regarding update
>> availability.  The project would also provide, free of charge,
>> 
>> certificates
>> 
>> to sign plugins.  This mechanism would support local mirroring to
>> 
>> support
>> 
>> air gapped management networks.
>> 
>> 
>> Fundamentally, I do not want to turn CloudStack into an erector set with
>> more screws than nuts which is a risk with highly pluggable
>> 
>> architectures.
>> 
>> As such, I think we would need to tightly bound the scope of drivers and
>> their behaviors to prevent the loss system usability and stability.  My
>> thinking is that drivers would be packaged into a custom JAR, CAR
>> (CloudStack ARchive), that would be structured as followed:
>> 
>> 
>> - META-INF
>>   - MANIFEST.MF
>>   - driver.yaml (driver metadata(e.g. version, name, description,
>>   etc) serialized in YAML format)
>>   - LICENSE (a text file containing the driver's license)
>> - lib (driver dependencies)
>> - classes (driver implementation)
>> - resources (driver message files and potentially JS resources)
>> 
>> 
>> The management server would acquire drivers through a simple scan of a
>> 
>> URL
>> 
>> (e.g. file directory, S3 bucket, etc).  For every CAR object found, the
>> management server would create an execution environment (likely a
>> 
>> dedicated
>> 
>> ExecutorService and Classloader), and transition the state of the
>> 
>> driver to
>> 
>> Running (the exact state model would need to be worked out).  To be
>> 
>> really
>> 
>> nice, we could develop a custom Ant task/Maven plugin/Gradle plugin to
>> create CARs.   I can also imagine an opportunities to add hooks to this
>> model to register instrumentation information with JMX and
>> 
>> authorization.
>> 
>> 
>> To keep the scope of this email confined, we would introduce the general
>> notion of a Resource, and (hand wave hand wave) eventually
>> 
>> compartmentalize
>> 
>> the execution of work around a resource [1].  This (hand waved)
>> compartmentalization would allow us the controls necessary to safely and
>> reliably perform in-place driver upgrades.  For an initial release, I
>> 
>> would
>> 
>> recommend implementing the abstractions, loading mechanism, extension
>> 
>> data
>> 
>> model, and discovery features.  With these capabilities in place, we
>> 
>> could
>> 
>> attack the in-place upgrade model.
>> 
>> If we were to adopt such a pluggable capability, we would have the
>> opportunity to decouple the vendor and CloudStack release schedules.
>> 
>> For
>> 
>> example, if a vendor were introducing a new product that required a new
>> 
>> or
>> 
>> updated driver, they would no longer need to wait for a CloudStack
>> 
>> release
>> 
>> to support it.  They would also gain the ability to fix high priority
>> defects in the same manner.
>> 
>> I have hand waved a number of issues that would need to be resolved
>> 
>> before
>> 
>> such an approach could be implemented.  However, I think we need to
>> 
>> decide,
>> 
>> as a community, that it worth devoting energy and effort to enhancing
>> 
>> the
>> 
>> plugin/driver model and the goals of that effort before driving head
>> 
>> first
>> 
>> into the deep rabbit hole of design/implementation.
>> 
>> Thoughts? (/me ducks)
>> -John
>> 
>> [1]: My opinions on the matter from CloudStack Collab 2013 ->
>> 
>> http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
>> 
>> 
>> 
>> 
>> --
>> *Mike Tutkowski*
>> *Senior CloudStack Developer, SolidFire Inc.*
>> e: mike.tutkowski@solidfire.com
>> o: 303.746.7302
>> Advancing the way the world uses the
>> cloud<http://solidfire.com/solution/overview/?video=play>
>> *™*
>> 
>> 
>> 
>> --
>> *Mike Tutkowski*
>> *Senior CloudStack Developer, SolidFire Inc.*
>> e: mike.tutkowski@solidfire.com
>> o: 303.746.7302
>> Advancing the way the world uses the
>> cloud<http://solidfire.com/solution/overview/?video=play>
>> *™*
>> 
>> 


Mime
View raw message