stratos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin Eppel (meppel)" <>
Subject RE: Maintenance modes (was RE: [jira] [Commented] (STRATOS-1234) Software Update Management Solution for Stratos)
Date Thu, 07 May 2015 16:14:10 GMT

I’ll schedule a webex call early next week, I’ll add the names on the current email list,
please let me know who else should be on



Adding David Spence

From: Lakmal Warusawithana []
Sent: Thursday, May 07, 2015 6:45 AM
Cc: Sandaruwan Nanayakkara (JIRA); Imesh Gunaratne (; Shaheedur Haque (shahhaqu);
Ryan Du Plessis (rdupless)
Subject: Re: Maintenance modes (was RE: [jira] [Commented] (STRATOS-1234) Software Update
Management Solution for Stratos)

+1, shall we have a call sometime next week?

On Thursday, May 7, 2015, Martin Eppel (meppel) <<>>
Hi Imesh, Sandaruwan

We would like to continue the discussion on this feature as we think this could be a useful
enhancement to stratos.

To get some idea about the effort and as a first steps towards an implementation I identified
the areas / components which IMHO need to be enhanced (based on stratos 4.1:
Btw, I also marked some of the items with a “?” - any feedback would be appreciated.

•        new Rest API to update resource state with maintenance mode:

o   PUT, resource types: application / group / cluster / instance

o   Maintenance mode on / off / restart / replace

•  sub state: autoscaling off / on

•  auto healing on / off

•        new API in autoscaler to set maintenance mode – not sure about that if necessary,
any pointers  ?

•        adding new / enhancing existing  topology events : [application / group /cluster
/ member]

o   enhancing messaging domain model to add maintenance state + sub states

o   adding / enhancing event handling in Autoscaler (receiver, monitors, etc …)

•  Event receiver / monitor for maintenance event

•  Can we utilize / reuse  ClusterMonitor->handleMemberMaintenanceModeEvent for this
feature ?


•        Adding maintenance state (In autoscaler e.g. ClusterStatusProcessor, GroupStatusProcessor,
etc. )

o   application

o   group

o   cluster

o   member – member already has a MAINTENANCE state, can we utilize it for this feature

•        enhance / add  drools rule to handle the new maintenance mode to turn on / off
autoscaling, auto healing

o   scale up / scale down, dependent scaling, min / max

o   logging requirements

•        AutoscalerHealthStatEventReceiver

o   Handle Fault Events in context of maintenance mode

•        Persistence of maintenance related states

o   Registry - any pointers on how the maintenance mode should be persisted  ?

Any thoughts or feedback on this, do you think there will be other components affected or
need to be reworked  ?

The other question would be what will be the best or recommended way to develop the feature
with the input from the community and to ensure a smooth integration with the stratos master



From: Shaheedur Haque (shahhaqu)
Sent: 09 April 2015 14:44
Sandaruwan Nanayakkara (JIRA); Imesh Gunaratne (<javascript:_e(%7B%7D,'cvml','');>)
Subject: RE: Maintenance modes (was RE: [jira] [Commented] (STRATOS-1234) Software Update
Management Solution for Stratos)

Hi Imesh, Sandaruwan,

Here is a written-up proposal. I *think* it covers the various use cases suggested both here
and in JIRA STRATOS-1234, but as always, your thoughts on the matter are welcome. The write-up
has the form of a “spec” and a “Q&A”. As a next step, I guess we could do a hang-out
or con-call or something?

Thoughts welcome…

Thanks, Shaheed


The following commands, with the defined effects, are needed:

•        No command directly affects what I call the “major state” of the Application/Group/Cluster/Cartridge,
i.e. the state as reflected in the information CURRENTLY returned by the application/{appId}/runtime

•        Each command affects what I call the “operational state” only. The commands
and their operational states are:

o   Autoscaling on, off. Autoscaling on is current behaviour.

o   Autohealing on, off. Autohealing on is current behaviour.

o   Maintenance off, restart, replace. Maintenance off is current behaviour.

o   (We can add more later if needed)


Server effect

Cartridge effect

Autoscaling off.

CEP and gathers stats and history as usual. Autoscalar operates as usual, except that no scaling
is done. Instead, a cluster state variable tracks the normal, overload or underload state
and logs messages when this state variable changes value.

No effect on running cartridges. No new cartridges are spun up, no existing cartridges are
spun down EXCEPT for autohealing.

Autohealing off.

CEP ignores any heartbeat timeout other than to log that it happened, and set an instance
state variable to track this.
When autohealing is turned back on, the timeout will happen again, and the failure will be
acted upon normally, except that the log shall make it clear (using the instance state variable)
that the autohealing had been delayed.

No new cartridges are spun up until after the autohealing is enabled.

Maintenance restart.

Like autohealing off except that the an extra state variable is set indicating maintenance
mode is in effect.

The both state variables are cleared when the Cartridge resume event is seen.

Cartridge is signalled with an *event*, not a blocking callout.

Cartridge application must be able to reboot or just restart, and have the cartridge agent
resume its previous (active/inactive) state. When resuming, the agent signals the server with
a resume *event*.

Note this implies the cartridge agent is restartable (because the application can choose to

Maintenance replace.

Like maintenance restart except that the cartridge instance is replaced.

The difference between “restart” and “replace” is that the latter is for applications
that cannot update themselves, but expect essentially a new VM instance with the new software.

In other words, this is the big hammer/most general approach to upgrades (e.g. this is more
likely to work that an apt-get downgrade ☺).

•        Each command referred to here is a REST API call.

•        Each command can apply to an entire Application, or any nested level (group or
cartridge) within it.

•        Arguments for application-wide use case:

o   application={appId}, operationalState={command}

•        Arguments for nested-level use case:

o   application={appId}, nesting={0}/{1}/{2}/…/{n}, operationalState ={command}


1.      What’s the point of restart/replace, over and above auto* off?

These are to actually cause the application software in the VM instance to take note to do
something. Typically, I would expect this to result in an internally-managed software update.
For example think of a VMs running Ubuntu, and pointing to a known repository of say security
patches, they could all just do a “apt-get update/upgrade”.

The Cartridge logic is defined to be event-based rather than blocking, because making the
thing blocking would be a problem if a reboot was involved. (Also, generally, blocking operations
in a distributed system raise too many edge cases like: can this operation be cancelled? Repeated?

2.      Propagation/inheritance rules

I see two options:

•        Use hierarchy. If you apply a thing a hierarchy level n, and n has internal structure
(i.e. it is a group not a cartridge), the command propagates all the way down (note: this
is implied in what I said for the application level command).

•        Do not use hierarchy. The command only applies to the level to which is was addressed
by the REST call.

In either case, the effect of contradictory commands is UNDEFINED, i.e. toggling the flags
in quick succession will likely result in an unhelpful outcome.

I think the normal approach is NOT to use hierarchy; after all just because there is a upgrade
to be applied for application code in a given set of VMs, there is nothing to say that any
elements lower down the hierarchy should be upgraded at the same time. Even in the case where
(say) security patches to a common OS are to be applied, I would doubt the sanity of anybody
doing this across every VM in the whole system in one go ☺. OTOH, maybe I am wrong!

3.      Should these commands apply to “deployed” or only to “configured” Applications?

I think the commands can be applied whether the Application is deployed or not….clearly
the stuff that sets flags on instances has to set those flags on all current and future instances
that may spin up under a given deployment.

From: Imesh Gunaratne [<javascript:_e(%7B%7D,'cvml','');>]
Sent: 27 March 2015 04:21
To: dev
Subject: Re: Maintenance modes (was RE: [jira] [Commented] (STRATOS-1234) Software Update
Management Solution for Stratos)

Hi Shaheed,

A really good suggestion! I think we could to manage what you have suggested in the same implementation
as they overlap. I'm +1 for the idea of putting a cluster into the "Maintenance Mode" manually
for diagnostic purposes and stop autoscaling it. We could introduce new API methods to manage
this. The only question is whether we could use the same instance state for all the scenarios:

1. Update platform (might need to use the term platform here as it may get confused with the
software that may run on the platform)
2. Apply patches
3. Pause a cluster for diagnostic purposes

I would like to suggest to change the updateSoftware API method to updatePlatform:
POST /applications/{applicationId}/updatePlatform

May be we could introduce a new API method as follows to put a cluster into "Maintenance/Diagnostic
POST /clusters/{clusterId}/pause


On Thu, Mar 26, 2015 at 3:01 PM, Shaheedur Haque (shahhaqu) <<javascript:_e(%7B%7D,'cvml','');>>

First, let me say that I like a lot of what is proposed in this JIRA, but I am forking the
thread here because I would like to suggest that we generalise just one part of it, the API
into Stratos to cover a set of related use cases.

In the current version of this JIRA, the proposed API into Stratos looks like this:

PUT /api/applications/{applicationId} /updateSoftware

(see the JIRA section 2.3 for the details). I think this is actually one of a set of possible
runtime states that we would like to put VM instances and various parts of Stratos in. Notice
that I am deliberately not using specific terms such as "cluster" or "Autoscalar" because
working that out is the point of this email.

So, the sorts of use cases I have in mind are:

  *   Updating the cartridge software as per this JIRA
  *   Putting a cluster (or maybe an instance) into a "maintenance mode" for diagnostic reasons.
There could be multiple versions of this maintenance mode where (for example)

     *   The instance(s) might still handle traffic and deliver "I'm alive" health stats but
no autoscaling is done.
     *   The instance(s) don't deliver health stats but no health stats

  *   Some of these would deliver notifications to the cartridge agent, others might only
affect Stratos component(s).
  *   etc...other ideas anybody?

Thus, it might make sense to generalise the API to support  a set of closely related cases.
Is there interest in taking such an approach to address this JIRA as well in clarifying and
addressing the other use cases?

Thanks, Shaheed

From: Sandaruwan Nanayakkara (JIRA) [<javascript:_e(%7B%7D,'cvml','');>]
Sent: 25 March 2015 08:36
Subject: [jira] [Commented] (STRATOS-1234) Software Update Management Solution for Stratos


Sandaruwan Nanayakkara commented on STRATOS-1234:

Hi all,

I have updated the Google doc with updating scenarios and please share your ideas by commenting
and will be pretty much appreciated.

After days I finally deployed almost all of the Stratos samples with kubernates and openstack
Now the main fuss is on triggering updates in different software. Can you give an example
on a software and how update is triggered manually. A practical approach??
Suppose that I have a software in a single cartridge application. So when triggering update
with the REST we need a specific way to communicate with the software. Is there any way that
this updating command is given to the software?


> Software Update Management Solution for Stratos
> ------------------------------------------------
> Key: STRATOS-1234
> URL:
> Project: Stratos
> Issue Type: New Feature
> Reporter: Imesh Gunaratne
> Labels: gsoc2015, mentor
> Stratos uses Virtual Machines and Containers for hosting platform services on different
Infrastructure as a Service (IaaS) solutions. At present Puppet is used for orchestration
management on Virtual Machine based systems and manages all required software in Puppet Master.
Container based systems creates Docker images for each platform service by including required
software in the Docker image itself.
> In Virtual Machine use-case VM instances will communicate with Puppet master and execute
the software installation. The same approach can be used for applying software updates.
> In Docker use-case we do not use Puppet because a new container with required software
can be started in few seconds. This is very efficient compared to using Puppet and installing
software on demand.
> The requirement of this project is to implement a core Stratos feature to propagate software
updates in a live PaaS environment.
> 1. Puppet based solution:
> - Push software updates of a cartridge to Puppet Master (might not need to automate).
> - Invoke the software update process via the Stratos API for a given application.
> - Stratos Manager could send a new event to trigger puppet agent in each instance to
apply the updates.
> 2. Docker based solution
> - Create a new docker image (with a new image id) for the cartridge with software updates
(might not need to automate).
> - Invoke the software update process via the Stratos API for a given application.
> - Autoscaler can implement a new feature to bring down existing instances and create
new instances with the new docker image id.
> Important!
> - In each scenario if updates are backward compatible, software update process should
execute in phases, it should not bring down the entire cluster to apply the updates. If so
the service will be unavailable for a certain time period. The idea is to apply the updates
to set of members at a time.
> - If the updates are not backward compatible, we could make the entire cluster unavailable
at once and apply the updates.
> - Member's state needs to be changed to a new state called "Updating" when applying the
> If there is an interest on doing this project please send a mail to imesh at apache dot
org by copying Apache Dev mailing list [1]. Please refer Stratos Wiki [2] for more information
on Stratos architecture and how it works.
> [1]
> [2]

This message was sent by Atlassian JIRA

Imesh Gunaratne

Technical Lead, WSO2
Committer & PMC Member, Apache Stratos

Sent from Gmail Mobile
View raw message