cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koushik Das <koushik....@accelerite.com>
Subject Re: [DISCUSS][FS] Host HA for CloudStack
Date Tue, 21 Feb 2017 08:47:19 GMT
See inline.

Thanks,
Koushik

On 21/02/17, 11:47 AM, "Rohit Yadav" <rohit.yadav@shapeblue.com> wrote:

    Hi Koushik,
    
    
    Thanks for sharing your comments and questions.
    
    
1. Yes, the FS is divided into two parts - a general HA framework which makes no assumption
about the type of resource and HA provider implementation that works on a type of resource/hypervisor/storage
etc.

[Koushik] Hmm the heading is misleading then. I would like to see the details of the generic
HA framework that you are proposing for any resource type. What all resource types can/need
to be HA’ed? Also I would like to see a clear definition of “storage HA”, ”network
HA” or “any resource HA” etc. before going ahead with this generic framework. If this
new framework ends up doing only doing Host/VM HA then there is no point doing all this.

Specifically, with this feature we want to solve the problem of HA-ing a host reliably and
use out-of-band management subsystem (i.e. ipmi based status/reboot/power-off to investigate/recover/fence
the host) in the HA provider implementation. Yes, a host HA should trigger VM HA, i.e. for
the host being fenced move HA VMs to other hosts. This also reliably solves the issue of disk
corruption when same HA VMs get started on multiple hosts.

[Koushik] If host HA implies doing HA on all VMs running in a host, I am not clear as to why
host HA is needed separately when there is already VM HA available.
    
    2. The old VM HA implementation makes a lot of assumptions about the type of resource
(i.e. VM) it is HA-ing, it is tied to VM HA which is why HA for host could not be added in
a straight forward way without regressions we could not test. With this new HA framework,
it does not make any assumption around type of the resource and separates policy from mechanism,
we also want to add deterministic tests (using marvin tests and a simulator based ha provider
implementation) to demonstrate the generic HA functionality. In future with this framework,
HA for various resources such as VM, storage, network can be added. As a first step we want
to get the framework in, and support for Host as a resource type. We also want to reduce assumptions,
or dependency as both VM HA and Host HA are related (sequence etc). The HAProvider interface
would be something every hypervisor can implement.

[Koushik] Again please justify why host HA is needed when VM HA is already there? If the question
is about ease of writing automated tests, I have already written simulator based tests for
the existing VM HA. Please refer https://cwiki.apache.org/confluence/display/CLOUDSTACK/Writing+tests+leveraging+the+simulator+enhancements
for the test details.
    
    3. While an existing (VM) HA framework exists, it was safer to write new code and demonstrate
it works for any general HA resource than refactor and implement this in the old framework
which could introduce serious regressions leading to production issues. For the most part,
we've avoided to alter anything in the old HA framework while making sure that old (VM) HA
works well with the new HA framework. The JIRA issue for the feature is in the FS.

[Koushik] As mentioned in a previous comment, please define what all resources need to be
HA’d and why is it needed? For e.g. there is RVR which provides HA for the network services
provided by VR. Also for other network plugins there may be native ways for achieving HA and
may not need anything from CS perspective. I wanted to make sure that all these points are
accounted for before we proceed with a generic framework.
    
    
    4. Any HA operation can be blocking in nature, one of the things included is a background
polling manager that polls for changes, and a task/activity executor as out-of-band operations
can take time. Therefore, all the health/activity/fencing/recovery operations have some timeout,
limits and specific queues. The existing framework does not provide any abstraction to queue,
restrict operation timeout, and tie them against a FSM. The existing framework also is hard
to test, specifically to validate using integration test. We also wanted to avoid adding any
regressions to existing/old VM HA. Lastly, the primary use of IPMI/out-of-band management
in performing host-ha is not for investigation but for recovery (try a reboot), and fencing
(power off).

[Koushik] A lot of points you have raised here is not correct. There is already polling of
all the hosts to find out VM state changes, queues, time-outs in place to send commands to
hypervisors etc. Have you evaluated the option of using IPMI in the existing KVM HA plugins?

 
    
    Hope this answers your questions, please feel free add more comments and questions. Thanks.
    
    
    Regards.
    
    
    ________________________________
    From: Koushik Das <koushik.das@accelerite.com>
    Sent: 20 February 2017 11:45
    To: dev@cloudstack.apache.org
    Subject: Re: [DISCUSS][FS] Host HA for CloudStack
    
    Rohit,
    
    Thanks for the effort you have put in writing the FS. I have some questions based on my
initial reading of the FS.
    
    1. “Host HA” – In the FS you are talking about a generic HA framework but it is
not clear what is meaning of “host HA”. Is it something like all or some VMs running on
a host will be started on another host(s) in case of a failure or is it something else? How
is it different from the existing “VM HA” that is already there?
    2. You have mentioned that “Cloudstack lacks a way to reliably fence host”. Cloudstack
considers VM as a 1st class object and so provides fencing for VM instead of host. There are
hypervisor specific plugins that implement mechanism to fence a VM. I am not sure if it makes
sense to expose host fencing as end user doesn’t care about it. Now the VM fencing implementation
can use something like “host fencing” internally.
    3. There is an existing HA framework which provides plugins for doing investigation if
a VM is alive or not, host is alive or not, fencing of VM in case it is not alive. It will
be good to understand the limitations of the existing framework and how the new framework
helps in solving these problems. We also need to understand if the limitation is in the framework
or some specific plugin implementation that is causing issues. Reference to JIRA issues would
help.
    4. You have mentioned about ipmi to investigate host failure. I would like to understand
why same can’t be used in the existing framework.
    
    Thanks,
    Koushik
    
    On 16/02/17, 4:48 PM, "Rohit Yadav" <rohit.yadav@shapeblue.com> wrote:
    
        All,
    
    
        I would like to start discussion on a new feature - Host HA for CloudStack.
    
        CloudStack lacks a way to reliably fence a host, the idea of the host-ha feature is
to provide a general purpose HA framework and HA provider implementation specific for hypervisor
that can use additional mechanism such as OOBM (ipmi based power management) to reliably investigate,
recover and fence a host. This feature can handle scenarios associated with server crash issues
and reliable fencing of hosts and HA of VM. The first version will have HA provider implementation
for KVM (and for simulator to test the framework implementation, and write marvin tests that
can validate the feature on Travis and others).
    
    
        Please have a look at the FS here:
    
        https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
    
    
        Looking forward to your comments and questions.
    
    
        Regards.
    
        rohit.yadav@shapeblue.com
        www.shapeblue.com<http://www.shapeblue.com>
        53 Chandos Place, Covent Garden, London  WC2N 4HSUK
        @shapeblue
    
    
    
    
    
    
    
    
    DISCLAIMER
    ==========
    This e-mail may contain privileged and confidential information which is the property
of Accelerite, a Persistent Systems business. It is intended only for the use of the individual
or entity to which it is addressed. If you are not the intended recipient, you are not authorized
to read, retain, copy, print, distribute or use this message. If you have received this communication
in error, please notify the sender and delete all copies of this message. Accelerite, a Persistent
Systems business does not accept any liability for virus infected mails.
    
    rohit.yadav@shapeblue.com 
    www.shapeblue.com
    53 Chandos Place, Covent Garden, London  WC2N 4HSUK
    @shapeblue
      
     
    
    




DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Accelerite,
a Persistent Systems business. It is intended only for the use of the individual or entity
to which it is addressed. If you are not the intended recipient, you are not authorized to
read, retain, copy, print, distribute or use this message. If you have received this communication
in error, please notify the sender and delete all copies of this message. Accelerite, a Persistent
Systems business does not accept any liability for virus infected mails.
Mime
View raw message