cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Burwell <>
Subject Re: [DISCUSS] KVM HA with IPMI Fencing
Date Thu, 10 Dec 2015 21:41:32 GMT
Ronald and Ilya,

Reviewing this FS, I feel that it is a very good idea.  However, I think it needs the following
refinements to ensure the HA fencing/recovery operations do not overwhelm the management server
and that the capabilities could be used to provide HA for non-KVM resources:

  * Separate Host Power Management: Host power management could be useful for more than just
repairing stalled KVM hosts.  For example, starting a new server before provisioning it using
the bare metal plugin or implementing a power management facility to shut down hosts when
demand is low.
  * Define a HA Resource Management Service: Extract the HA check and recovery finite state
machines (FSM), persistence, and distributed semantics.  The check and recovery of specific
resource/device types would be provided via plugins.  In the future, we may also want to consider
exposing the service to end users -- allowing applications to leverage it for their HA needs.
  * Build KVM Host HA into the HA Resource Management Service: Implement KVM Host HA in terms
HA resource management service and the IPMI feature.  This approach will increase the cohesion
of the KVM plugin by colocating HA check and recovery implementations.  It will also allow
us to separate the requirements for proper KVM host HA operation from the more general requirements
for HA operation on the CloudStack control plane.

Finally, we need to ensure that the HA workload can be fairly scaled and recovered across
a management server cluster.  I don't think we want to create an HA/fencing mechanism that
is unreliable when one or more management servers in a cluster fail.

Host Power Management Service

While I would like to see IPMI extracted to another FR, I am providing my feedback here because
power management impacts KVM Host HA.  Given the wide range of systems management interfaces
(e.g. IPMI, ILO, DRAC, converged solutions, etc), it seems provide a pluggable power management
service.  This service would manage per-host power management FSM to capture the current power
status using the following states:

  * ON: The host is powered on
  * OFF: The host is powered off
  * UNKNOWN: The host's power status cannot be determined

The FSM would default to UNKNOWN and transition based on power state retrieved.  If the power
status can not be retrieved (e.g. no provider avaiable, not configurated, unable to connect
to the system management interface), the state would remain UNKNOWN.  It also needs to account
for the following failure scenarios:

  * The power status of the host is changed outside of CloudStack (e.g. a datacenter technician
manually powers down a host)
  * Connectivity to the host's system management interface is lost
  * Responses from the system management interface are too slow

Both of these scenarios can be handled with a power state sync daemon that regularly queries
the power state of the host.  In the event that the power status changed externally or the
management server can no longer query the host's system management interface, the sync daemon
would transition the host's power management FSM to the appropriate state.  To protect against
slow responses from the system management interface, all power management operations should
be bounded by a timeout.  It seeems to me that the timeout should be configured globally with
a per-host override.  Finally, the power management FSM state transitions should trigger the
following events:

  * ON->OFF: The power state sync daemon detects that the host was turned off
  * OFF->ON: The power state sync daemon detects that the host was turned on
  * ON->UNSUPPORTED: The power state sync daemon loses connectivity to the system management
  * UNSUPPORTED->OFF, UNSUPPORTED->ON: The power state sync daemon gains connectivity
to the system management controller
  * User triggered restart
  * User triggered power on
  * User triggered power off

All changes to a host's power state would be broadcast on the event bus.

In terms of implementing IPMI, it is important to note that the FS assumes that all management
servers will have access to the system management network.  I also think we should default
to a native Java implementation such as ipmi4j [1] with a configurable fallback to shell scripts.
 I have two issues with shell scripts -- hidden/unmanaged dependencies and process overhead.
 These type of shell scripts introduce platform dependencies that are difficult to manage
and vary across distributions making them difficult to properly test across the various Linux
distributions.  They also incur additional JVM overhead to execute with varied error reporting
semantics.  By defaulting a Java library with a user-defined shell script fallback, the default
management server behavior is more deterministic and testable while providing users with the
ability customize it and the burden of managing additional system dependencies/platform specific
issues.  Finally, it seems reasonable to allow users to define these preferences globally,
as well as, on a per-host basis.  In the future, we may want to consider a host profile feature
that would allow different common attributes for similar hosts to be defined to simplify host

HA Resource Management Service

The FS covers basic operation of a KVM host HA well.  It lays out model that can be divided
into a HA resource management service that coordinates check and recovery operations over
a set of a resources contained in a partition (i.e. zone, pod, cluster) and providers that
implement check and recovery operations for specific resource types.  The CloudStack control
plane requires that this HA resource management service achieve the following goals:

  * Operational Simplicity: The service should be as simple to install, configure, and manage
as any other management server service.  CloudStack's operational simplicity is a significant
advantage that should not be compromised for more complex use cases.  I see it as our job
to design the system to make operation of these complex use cases as simple and straightfoward
as possible.
  * Leverage Abstractions: The service should leverage the abstractions exposed by the control
plane.  A significant amount of effort has been invested to integrate devices into CloudStack.
 By composing our existing logical abstracttions (e.g. volumes, power management, etc), the
service transperantly gains benefits of this work.
  * Integrated Resource Management: The service should feedback into other resource management
activities (e.g. allocation, scheduling, etc) to understand the health of resources and support
advanced,contextual recovery modes.

I believe that the only way to accomplish these goals is to implement a HA/fencing service
natively in the management server.  Relying on a external system to heartbeat, fence, and
recover resources would increase both the configuration and operational complexity of the
system.  Embedding the HA system into the control plane allows it to be designed to fail consistently
with the rest of the management server.  From an operational perspective, a batteries included
approach not only simplifies installation, but also monitoring and error debugging.  It would
also require us to duplicate the existing integration work increasing test effort and decreasing
the cohesion of the system.  Finally, implementing a resource state synchronization mechanism
that properly guards that deterministically resolves conflicts due to paritions with clustered
management servers would be likely be as complexity as building it natively.  The one advantage
that existing solutions such as Linux HA provide is battle tested hardness.  Given the need
for correctness in this type of service, this attribute should not be discounted.  However,
STONITH is a well-documented, relatively straight-forward strategy that I believe we properly
implement and test.  I also believe that the complexity of properly integrating an external
solution to CloudStack's core control plane would introduce a larger instability/reliability

HA providers define the type of parition on which they operate (e.g. KVM hosts in a cluster)
and the type of resource.  They implement the check and recovery operations required to determine
state transitions for the FSMs managed by the service.  Administators can enable or disable
HA management for paritions that support HA.  For HA enabled partitions, the HA resource manaagement
service maintains a finite state machine (FSM) with the following states:

  * NONE: HA is disabled for the partition
  * ACTIVE: HA is enabled for the partition and conditions within it permit HA operation
  * INACTIVE: HA is enabled for the partition and conditions within it do not support HA operation

The HA service would only operate on resources within scopes/partitions with an ACTIVE state.

We also need to consider the risk of HA/fencing contributing to system instability and failure.
 In order to mitigate this risk, there are number of failure modes that should also be considered.
 To reason about these failure modes, I propose the HA resource management service maintain
a per-resource FSM with the following states:

  * NONE: The resource is part of a partition where HA operations have been disabled or the
containing partition HA state is INACTIVE.
  * INITIALIZING: The health and eligibility of the resource for HA management is currently
being determined.  If the resource is in a HA scope/partition, its associated HA provider
indicates it is eligible for HA, and it passes a health check, the HA state will be transitioned
to AVAILABLE.  If it is not part of HA scope/partition and/or the associated HA provider indicates
it is not eligible for HA, then the HA state is transitioned to UNSUPPORTED.  If the resource
is in an HA scope/partition, it associated HA provider indicates it is eligible for HA, and
it fails a health check, the HA state will transitioned to SUSPECT.
  * AVAILABLE: The resource is available based on the passage of the most recent health check
and it containing partition has an HA state of ACTIVE.  When transitioning to this state,
the number of retry attempts is reset.
  * INELIGIBLE: The resource's enclosing partition has an HA state of ACTIVE but its current
state does not support HA check and/or recovery operations.  Any resource in maintenance mode
is automatically transitioned to INELIGIBLE.
  * SUSPECT: The resource is currently suspected of a failure due to failing its most recent
health check.  It is pending an activity check.  When a node fails multiple activity checks,
the duration between check will decay to maximum interval specified as a global setting (e.g.
first check after 10 seconds, second check after 20 seconds, third check after 40 seconds
to a maximum interval of 250 seconds)
  * CHECKING: An activity check is currently being performed on the resource.  If the activity
check passes, the HA state of the resource is transitioned to back to SUSPECT.  If the activity
check fails, the HA state of the resource is transitioned to RECOVERING.
  * RECOVERING: The resource failed its activity check and automated recovery operations are
in-progress. If the recovery operation succeeds, the HA state of the resource is transitioned
to INITIALIZING.  If the recovery operation fails and the number of recovery attempts is less
than the maximum attempts, the recovery operation is retried.  If the recovery operation fails
and the number of retry operations is equal to or exceeds the maximum retry attempts, the
HA state of the resource is transitioned to FAILED.
  * FAILED: The resource is not operating normally and automated attempts to recover it failed.

A health check is an active operation perdiocally executed to verify connectivity to and proper
operation of API endpoints to control a resource.  To allow the control plane to detect manual
correction of a FAILED resource, health checks should run regardless of a resource's HA state.
 An activity check is a passive operation that observes side effects of a resource's health
operation to determine whether or not it is functioning properly but unable to communicate
with the control plane.  Typically, health checks are cheap to perform regularly where activity
checks require more work and/or coordination to perform.  As such, activity checks are only
performed when a health check fails.  Finally, health and activity checks, as well as, recovery
operations are atomic and idempotent actions bounded by a completion timeout.

When a HA state of an HA enabled partition transitions from INACTIVE to ACTIVE, the HA state
of all contained resources is transitioned from NONE to INITIALIZED.  When the HA state of
an HA enabled partition transitions from ACTIVE to INACTIVE, the HA state of all contained
resources is transitioned to NONE.  Based on the HA resource and scope/partition state models,
the following failure scenarios and their responses should be included in the design of the

  * Activity check operation fails on the resource:  Provide a semantic in the activity check
protocol to express that an error while performing the activity check and a reason for the
failure (e.g. unable to access the NFS mount).  If the maximum number of activity check attempts
has not been exceeded, the activity check will be retried.
  * Slow check operation:  After a configurable timeout, the management server abandons the
check.  The response to this condition would be the same as a failure to recover the resource.
  * Traffic flood due to a large number of resource recoveries: The HA resource management
service limits the number of concurrent recovery operations permitted to avoid flooding the
management server with resource status updates as recovery operations complete.
  * Processor/memory starvation due to large number of activity check operations: The HA resource
management service limits the number of concurrent activity check operations permitted per
management server to prevent checks from starving other management server activities of scarce
processor and/or memory resources.
  * A SUSPECT, CHECKING, or RECOVERING resource passes a health check before the state action
completes: The HA resource management service refreshes the HA state of the resource before
transition.  If it does not match the expected current state, the result of state action is

The following observations are based on these failure scenarios and the HA state model:

  * Only resources with an HA state of AVAILABLE should be eligible for work allocation to
compute offerings requiring HA
  * Resources with an HA state of NONE, INELIGIBLE, INITIALIZING, and AVAILABLE should be
eligible for work allocation to compute offerings that do not require HA
  * Allocation should prefer non-HA partitions for resources using a compute offering that
does not require HA
  * A global setting for the maximum number of concurrent recovery operations on a per management
server basis

The HA Resource Management Service should trigger the following system events for the following
resource HA state transitions:

  * SUSPECT->CHECKING:  An activity check is being performed on a resource
  * CHECKING->RECOVERING: A resource failed its activity check and recovery is being attempted
  * CHECKING->FENCED: A resource passed its activity check and it is not controllable by
the control plane
  * RECOVERING->FAILED: Recovery of a resource failed
  * RECOVERING->INITIALIZED: Recovery of a resource succeeded
  * FENCED->INITIALIZING: A FENCED resource passes a health check
  * FAILED->INITIALIZING: A FAILED resource passes a health check

The service should also trigger a system event when a scope/partition transitions between
the ACTIVE and INACTIVE states.

Finally, VM allocation already supports HA/non-HA compute offerings.  We will likely need
to refine allocation to ensure that VM using a HA compute offering are only allocated to hosts
with an AVAILABLE HA state.  Furthermore, we need to ensure that allocation prefers hosts
with an HA status of NONE for VMs with a non-HA compute offering.  For the initial implementation,
it seems acceptable to leave VMs requiring an HA compute server on hosts whose HA status transitions
from ACTIVE to INELIGIBLE or NONE.  A future enhancement could address how to properly address
this scenario.


Under the model described in the previous section, the activity check, host eligibility determination,
and recovery actions would be implemented as a HA Resource Management Service provider in
the existing KVM plugin.  This HA provider would compose the host power management service
to implement the reboot strategy described in the FS.  The primary advantage of this approach
is a high cohesion of code support KVM integration and decouple the plugin from a specific
system management interface type.  Per the FS, the KVM HA provider would be defined with a
cluster scope/partition and determine that hosts meeting the criteria as eligible for HA:

  * The host's hypervisor is KVM
  * One or more VMs on the host use NFS shared storage
  * The host's power management state is ON

KVM clusters that meet the following criteria will be transitioned to an ACTIVE state:

  * The cluster is enabled
  * The pod containing the cluster is enabled
  * The zone containing the cluster's pod is enabled
  * Contain at least two (2) hosts with an HA state of AVAILABLE

The KVM Host HA provider would reassess the cluster's HA state whenever the HA state of one
of its contained resources transitions.

The provider would monitor events from the power management service for all KVM hosts in an
HA cluster.  To address scenarios where a change in a host's power state affects its HA eligibility,
the plugin would respond to the following host power state transitions:

  * OFF->ON: The host's HA state transitioned to INITIALIZING.  To prevent futile check
and recovery attempts in the period between power up and OS boot/KVM agent start, a configurable
quiet period will be specified globally and overridable on the host configuration.  When a
failed health check occurs during this quiet period, the state will remain INITIALIZING. 
If a successful health check occurs during the quiet period, the HA state will be transitioned
to AVAILABLE and the quiet period discontinued.  Once the quiet period expires, a failed health
check will trigger a transition of the HA state to SUSPECTED.
  * ON->OFF, UNKNOWN->OFF, OFF->UNKNOWN: The host's HA state is transitioned to UNSUPPORTED
because the host is not running
  * ON->UNKNOWN: All states except FENCED and FAILED are transitioned to UNSUPPORTED because
the control plane cannot communicate with the system management interface.  The FENCED and
FAILED states are maintained because an inability to access the system management interface
does not affect the ability of the affect these states.
  * UNKNOWN->ON: All states except FENCED and FAILED are transitioned to INITIALIZING to
reassess the HA elibility and recalculate the host's HA state.  The FENCED and FAILED states
are maintained because a repair to a partition in the system management network does not indicate
that the host's operation has been repaired.

The FS specifies that when the management server loses its connection with the KVM agent,
the underlying shared storage must checked for activity to determine whether or not the host
can be safely rebooted.  Therefore, the KVM HA health check would be KVM agent connectivity,
and the activity check would be checking the underlying VM volumes for write activity since
the last successful health check.  Conceptually, I really like the idea of checking for disk
activity via the control plane using the volume.  The most significant I see is reusing our
abstractions to query information from the infrastructure and placing the activity check implementation
in the storage driver where it belongs.  However, implementing the capability may introduce
more problems than it solves.  As an example, NFS would likely require the introduction of
system VMs to mount NFS volumes and collect file information -- introducing another set of
failure scenarios that would complicate the design and operation of the system.  Therefore,
I think we should investigate an implementation of this activity check though CloudStack's
Volume abstraction to determine if we can implement it without incurring unacceptable complexity
and compromising system reliability.

If the activity check cannot be reliabily performed via the control plane, the FS specifies
that activity checks will be performed by adjacent node in the cluster.  This approach assumes
that all hosts in the cluster have access to the same underlying NFS mount.  These checks
should only be performed by hosts with a HA state of AVAILABLE.  In the event a host performing
an activity check encounters an error during the check operation (e.g. unable to read the
NFS mount), the HA state of the checking host would be transitioned to SUSPECT.  Finally,
as specified, the activity check relies on the clocks between the management server, host,
and NFS server being in sync.  Using a relative check that watched a file for a timestamp
change within a specified number of seconds would eliminate this time drift issue.  To avoid
live-locking of activity check threads in the HA resource management service,  we may want
to consider implementing activity checks request-reply model with a reply timeout.

Per the FS, the KVM Host HA provider recovers a host by power cycling hosts that fail activity
checks.  Before performing the power cycle, the provider would refresh the power state and
verify it is ON before restarting the host.  The power state refresh protects against a race
condition between the actual power state of the host changing and the control plane recognizing
the state change.  The handling the ON->OFF and OFF->ON power state transitions will
handle waiting for the power cycle to complete and resetting the HA state FSM.  Finally, if
the host power cycle operation fails, the HA state of the host would be transitioned to FAILED.

Clustered Management Servers

Currently, management server clustering divides the host ownership across management servers.
 This ownership is a static calculation that does not rebalance when a management server is
added or removed from a cluster.  Most critically, there is no handoff of host ownership when
a management server fails.  For the HA resource management service to operate reliabily, we
must address these gaps in the host ownership model.  Additionally, the HA resource management
server would require clustering to support ownership of partitions (e.g. zones, pods, and
clusters) to properly manage the HA state of partitions and perform partition-scoped operations.
 Given the scope of this topic, I believe it should be address in greater depth in a separate


With some refinement, I believe we can add a powerful and resilent HA/fencing capability to
the CloudStack control plane.  It would not only support KVM Host HA, but any other resource
type managed by the control plane.  I think it would be best to decompose the effort into
four parts — Host Power Management, HA Resource Management Service, KVM Host HA Provider,
and management server clustering improvements.



John Burwell

d:       | s: <tel:|%20s:>       |      m:      703-873-7089<tel:703-873-7089>

e: | t: 703-566-9597<|%20t:%20703-566-9597>
 |      w:<>

a:      53 Chandos Place, Covent Garden London WC2N 4HS UK


Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India
LLP is a company incorporated in India and is operated under license from Shape Blue Ltd.
Shape Blue Brasil Consultoria Ltda is a company incorporated in Brasil and is operated under
license from Shape Blue Ltd. ShapeBlue SA Pty Ltd is a company registered by The Republic
of South Africa and is traded under license from Shape Blue Ltd. ShapeBlue is a registered
This email and any attachments to it may be confidential and are intended solely for the use
of the individual to whom it is addressed. Any views or opinions expressed are solely those
of the author and do not necessarily represent those of Shape Blue Ltd or related companies.
If you are not the intended recipient of this email, you must neither take any action based
upon its contents, nor copy or show it to anyone. Please contact the sender if you believe
you have received this email in error.

On Oct 20, 2015, at 7:21 AM, Ronald van Zantvoort <<>>

On 19/10/15 23:10, ilya wrote:

Please see response in-line...
And you too :)

On 10/19/15 2:18 AM, Ronald van Zantvoort wrote:
On 16/10/15 00:21, ilya wrote:
I noticed several attempts to address the issue with KVM HA in Jira and
Dev ML. As we all know, there are many ways to solve the same problem,
on our side, we've given it some thought as well - and its on our to do

Specifically a mail thread "KVM HA is broken, let's fix it"

We propose the following solution that in our understanding should cover
all use cases and provide a fencing mechanism.

NOTE: Proposed IPMI fencing, is just a script. If you are using HP
hardware with ILO, it could be an ILO executable with specific
parameters. In theory - this can be *any* action script not just IPMI.

Please take few minutes to read this through, to avoid duplicate

Proposed FS below:

Hi Ilja, thanks for the design; I've put a comment int 8943, here it is
verbatim as my 5c in the discussion:

ilya musayev: Thanks for the design document. I can't comment in
Confluence, so here goes:
When to fence; Simon Weller: Of course you're right that it should be
highly unlikely that your storage completely dissappears from the
cluster. Be that as it may, as you yourself note, first of all if you're
using NFS without HA that likelihood increases manyfold. Secondly,
defining it as an anlikely disastrous event seems no reason not to take
it into account; making it a catastrophic event by 'fencing' all
affected hypervisors will not serve anyone as it would be unexpected and

The entire concept of fencing exists to absolutely ensure state.
Specifically in this regard the state of the block devices and their
data. Marcus Sorensen: For that same reason it's not reasonable to 'just
assume' VM's gone. There's a ton of failure domains that could cause an
agent to disconnect from the manager but still have the same VM's
running, and there's nothing stopping CloudStack from starting the same
VM twice on the same block devices, with desastrous results. That's why
you need to know the VM's are very definitely not running anymore, which
is exactly what fencing is supposed to do.

For this, IPMI fencing is a nice and very often used option; absolutely
ensuring a hypervisor has died, and ergo the running VM's. It will
however not fix the case of the mass rebooting hypervisors (but rather
quite likely making it even more of an adventure if not addressed properly)
Now, with all that in mind, I'd like to make the following comments
regarding ilya musayev 's design.

First of the IPMI implementation: There's is IMHO no need to define IPMI
(Executable,Start,Stop,Reboot,Blink,Test). IPMI is a protocol, all these
are standard commands. For example, using the venerable `ipmitool` gives
you `chassis power (on,status,poweroff,identify,reset)` etc. which will
work on any IPMI device; only authentication details (User, Pass, Proto)
differ. There's bound to be some library that does it without having to
resort to (possibly numerous) different (versions of) external binaries.
I am well aware of this - however, i want this to be flexible. There may
be a hardware that does not conform to OpenIPMI standard (yet). Perhaps
i want to use a wrapper script instead that may do few more actions
besides IPMI. While initially we intended for this to be IPMI, i dont
want to limit it to just IPMI. Hence, some flexibility would not hurt.
It can be IPMI - or anything else, but i want this to be flexible.

Something like STONITH comes to mind; This venerable Linux-HA project has been the fencing
cornerstone of Pacemaker/Corosync clusters for years. It contains a ton of standarized fencing
mechanisms and although I haven't used in quite a while, it should be generically employable.

You'd simply provide a configuration interface for maybe even multiple STONITH fencing mechanisms
without having to reinvent all these wheels yourself.

Secondly you're assuming that hypervisors can access the IPMI's of their
cluster/pod peers; although I'm not against this assumption per sé, I'm
also not convinced we're servicing everybody by forcing that assumption
to be true; some kind of IPMI agent/proxy comes to mind, or even
relegating the task back to the manager or some SystemVM. Also bear in
mind that you need access to those IPMI's to ensure cluster
functionality, so a failure domain should be in maintenance state if any
of the fence devices can't be reached
Good point, perhaps a test needs to be build in to confirm that ACS can
reach its target with proper credentials.

Thirdly your proposed testing algorithm needs more discussion; after
all, it directly hits the fundamental principal reasons for why to fence
a host, and that's a lot more than just 'these disks still gets writes'.
Agree, hence this thread :)

In fact, by the time you're checking this, you're probably already
assuming something's very wrong with the hypervisor, so why not just
fence it then?
Because its not that simple with they way cloudstack works now. There
are many corner case we need to consider before we pull the plug. There
could be numerous issues - some temporary environmental problems. Few
things come to mind now:
1) Agent can die
2) Communication issue between MS and KVM host

The decision to fence should lie with the first
notification that some is (very) wrong with the hypervisor, and only
limited attempts should be made to get it out. Say it can't reach it's
storage and that get's you your HA actions; why check for the disks
Because we realize that existing HA framework is not very robust and
also "incomplete". We can write many test cases and follow them one by
one, but if fundamentally - your VM is up and running (writting to disk)
and no other hypervisor owns this VM - your hypervisor in question is
functional. We would like to stay on the cautious side, even if single
VM is up on hypervisor, it wont be killed.

Try to get the storage back up like 3 times, or for 90 sec or so,
then fence the fucker and HA the VM's immediately after confirmation. In
fact, that's exactly what it's doing now, with the side note that
confirmation can only reasonably follow after the hypervisor is done
Both check interval and sleep in between should be configurable. If you
dont want to wait 30 and do 3 tests, do just 1 test and set it to 1
second, its upto the end user what he wants it to be. We are providing
the framework to achieve this, end user gets to plugin what he feel is

My remarks were more workflow-related: I'd propose something like
<somethings' wrong> --> <put hv in alert state> --> <best effort to fix>
--> <fence> --> <confirm> --> <put hv in down state, & HA VM's>

Your proposal to me read something like
<somethings' wrong> --> <check or fence> --> <best effort to fix>
--> <check or fence> --> <try to fix again> --> <fence> -->
<put hv in down state, & HA VM's>

Finally as mentioned you're not solving the 'o look, my storage is gone,
let's fence' * (N) problem; in the case of a failing NFS:
Every host will start IPMI resetting every other hypervisor; by then
there's a good chance every hypervisor in all connected clusters are
rebooting, leaving a state where there's no hypervisors in the cluster
to fence others; that in turn should lead to the cluster falling in
maintenance state, which will lead to even more bells & whistles going off.
They'll come back, find the NFS still gone, and continue resetting
each other like there's no tomorrow

This is a dooms day scenario that should not happen (based on proposed
1) This feature should apply only to "Shared" storage setup (not mixed -
local and shared)

Whenever 'VM's with HA flag' + 'VM's can HA' is involved this mechanism should be in place?
Even local storage VM's can find themselves in dire straits with a need to fence the entire
hypervisor to get them running again.

2) If you loose storage underneath as you propose the (entire array),
all hosts in the cluster will be down. Therefore, no hypervisor
hypervisors will be qualified to do a neighbor test. The feature we
propose would not go in shooting spree - because there will be no
hypervisor that is in "Up" state.
In addition, if something like that was to occur, another safety check
should be there. Quoting confluence page:
"Failure actions must be capped to avoid freak issue shutdowns, for
example allow no more than 3 power downs within 1 hour -> both options
configurable, if this condition is met – stop and notify"

Yeah kinda what I said in the 'clusters fence themselves' paradigm. That's one of the core
things that absolutely needs to be got right IMHO.
I might be biased because I had situations like this happening to me in a number of different
sizable clusters, which was really bad for my life expectancy ;)

Support staff already panicking over the NFS/network outage now has to
deal with entire clusters of hypervisors in perpetual reboot as well as
clusters which are completely unreachable because there's no one left to
check state; this all while the outage might simply require the revert
of some inadvertent network ACL snafu

In current implementation, cloudstack agents will commit suicide, we
would for this to change as well and make it user configurable.

I'd be rather careful with that. We need to find the right balance between configurables (timeouts,
retries, fencing mechanisms & authorities) and irresponsible users (no fencing 'cauz'
that's bothersome, but try HA nonetheless so my VM's are back up quickly, then start complaining
loudly about CloudStack when a small number of 'em have borked up their data)

Although I well understand Simon Weller's concerns regarding agent
complexity in this regard, quorum is the standard way of solving that
problem. On the other hand, once the Agents start talking to each other
and the Manager over some standard messaging API/bus this problem might
well be solved for you; getting, say, Gossip or Paxos or any other
clustering/quorum protocol shouldn't be that hard considering the amount
of Java software already doing just that out there.

Gossip can be a possibility if it solves our challenge. We are looking
for the lowest common denominator that is also most accurate, hence we
propose "write check". Gossip or another solution, maybe a much larger

Well Gossip is just one idea that came to mind. The larger point (which as you note would
me a much larger undertaking) is that there's a bunch of projects, also under the Apache umbrella,
which CloudStack could leverage rather than trying to do this really tough stuff all over

Heck, considering the possible savings in developer load, CloudStack might even integrate
or start deeply leveraging stuff like Apache Mesos or Stratos.

Another idea would be to introduce some other kind of storage
monitoring, for example by a SystemVM or something.
If you'll insist on the 'clusters fence themselves' paradigm, you could
maybe also introduce a constraint that a node is only allowed to fence
others if itself is healthy; ergo if it doesn't have all storages
available, it doesn't get to fence others whose storage isn't available.

Kinda what you already said above ;)

Find out more about ShapeBlue and our range of CloudStack related services:
IaaS Cloud Design & Build<> |
CSForge – rapid IaaS deployment framework<>
CloudStack Consulting<> | CloudStack Software
CloudStack Infrastructure Support<>
| CloudStack Bootcamp Training Courses<>
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message