Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cloudstack.apache.org
Subject: Re: [DISCUSS] KVM HA with IPMI Fencing
To: dev@cloudstack.apache.org
References: <108236900.12374.1444432732520.JavaMail.zimbra@li.nux.ro>
 <562026E6.9030907@gmail.com> <5624B565.1060305@pcextreme.nl>
 <56255C38.5040407@gmail.com>
From: Ronald van Zantvoort <ronald@pcextreme.nl>
Organization: PCextreme
Message-ID: <562623D3.6000905@pcextreme.nl>
Date: Tue, 20 Oct 2015 13:21:55 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <56255C38.5040407@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

On 19/10/15 23:10, ilya wrote:
> Ronald,
>
> Please see response in-line...
And you too :)

>
> On 10/19/15 2:18 AM, Ronald van Zantvoort wrote:
>> On 16/10/15 00:21, ilya wrote:
>>> I noticed several attempts to address the issue with KVM HA in Jira and
>>> Dev ML. As we all know, there are many ways to solve the same problem,
>>> on our side, we've given it some thought as well - and its on our to do
>>> list.
>>>
>>> Specifically a mail thread "KVM HA is broken, let's fix it"
>>> JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
>>> JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8643
>>>
>>> We propose the following solution that in our understanding should cover
>>> all use cases and provide a fencing mechanism.
>>>
>>> NOTE: Proposed IPMI fencing, is just a script. If you are using HP
>>> hardware with ILO, it could be an ILO executable with specific
>>> parameters. In theory - this can be *any* action script not just IPMI.
>>>
>>> Please take few minutes to read this through, to avoid duplicate
>>> efforts...
>>>
>>>
>>> Proposed FS below:
>>> ----------------
>>>
>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+HA+with+IPMI+Fencing
>>>
>>>
>>
>>
>> Hi Ilja, thanks for the design; I've put a comment int 8943, here it is
>> verbatim as my 5c in the discussion:
>>
>> ilya musayev: Thanks for the design document. I can't comment in
>> Confluence, so here goes:
>> When to fence; Simon Weller: Of course you're right that it should be
>> highly unlikely that your storage completely dissappears from the
>> cluster. Be that as it may, as you yourself note, first of all if you're
>> using NFS without HA that likelihood increases manyfold. Secondly,
>> defining it as an anlikely disastrous event seems no reason not to take
>> it into account; making it a catastrophic event by 'fencing' all
>> affected hypervisors will not serve anyone as it would be unexpected and
>> unwelcome.
>
>> The entire concept of fencing exists to absolutely ensure state.
>> Specifically in this regard the state of the block devices and their
>> data. Marcus Sorensen: For that same reason it's not reasonable to 'just
>> assume' VM's gone. There's a ton of failure domains that could cause an
>> agent to disconnect from the manager but still have the same VM's
>> running, and there's nothing stopping CloudStack from starting the same
>> VM twice on the same block devices, with desastrous results. That's why
>> you need to know the VM's are very definitely not running anymore, which
>> is exactly what fencing is supposed to do.
>
>> For this, IPMI fencing is a nice and very often used option; absolutely
>> ensuring a hypervisor has died, and ergo the running VM's. It will
>> however not fix the case of the mass rebooting hypervisors (but rather
>> quite likely making it even more of an adventure if not addressed properly)
>> Now, with all that in mind, I'd like to make the following comments
>> regarding ilya musayev 's design.
>
>> First of the IPMI implementation: There's is IMHO no need to define IPMI
>> (Executable,Start,Stop,Reboot,Blink,Test). IPMI is a protocol, all these
>> are standard commands. For example, using the venerable `ipmitool` gives
>> you `chassis power (on,status,poweroff,identify,reset)` etc. which will
>> work on any IPMI device; only authentication details (User, Pass, Proto)
>> differ. There's bound to be some library that does it without having to
>> resort to (possibly numerous) different (versions of) external binaries.
> I am well aware of this - however, i want this to be flexible. There may
> be a hardware that does not conform to OpenIPMI standard (yet). Perhaps
> i want to use a wrapper script instead that may do few more actions
> besides IPMI. While initially we intended for this to be IPMI, i dont
> want to limit it to just IPMI. Hence, some flexibility would not hurt.
> It can be IPMI - or anything else, but i want this to be flexible.

Something like STONITH comes to mind; This venerable Linux-HA project 
has been the fencing cornerstone of Pacemaker/Corosync clusters for 
years. It contains a ton of standarized fencing mechanisms and although 
I haven't used in quite a while, it should be generically employable.

You'd simply provide a configuration interface for maybe even multiple 
STONITH fencing mechanisms without having to reinvent all these wheels 
yourself.


>> Secondly you're assuming that hypervisors can access the IPMI's of their
>> cluster/pod peers; although I'm not against this assumption per sé, I'm
>> also not convinced we're servicing everybody by forcing that assumption
>> to be true; some kind of IPMI agent/proxy comes to mind, or even
>> relegating the task back to the manager or some SystemVM. Also bear in
>> mind that you need access to those IPMI's to ensure cluster
>> functionality, so a failure domain should be in maintenance state if any
>> of the fence devices can't be reached
> Good point, perhaps a test needs to be build in to confirm that ACS can
> reach its target with proper credentials.
>
>
>> Thirdly your proposed testing algorithm needs more discussion; after
>> all, it directly hits the fundamental principal reasons for why to fence
>> a host, and that's a lot more than just 'these disks still gets writes'.
> Agree, hence this thread :)
>
>
>> In fact, by the time you're checking this, you're probably already
>> assuming something's very wrong with the hypervisor, so why not just
>> fence it then?
> Because its not that simple with they way cloudstack works now. There
> are many corner case we need to consider before we pull the plug. There
> could be numerous issues - some temporary environmental problems. Few
> things come to mind now:
> 1) Agent can die
> 2) Communication issue between MS and KVM host
>
>
> The decision to fence should lie with the first
>> notification that some is (very) wrong with the hypervisor, and only
>> limited attempts should be made to get it out. Say it can't reach it's
>> storage and that get's you your HA actions; why check for the disks
>> first?
> Because we realize that existing HA framework is not very robust and
> also "incomplete". We can write many test cases and follow them one by
> one, but if fundamentally - your VM is up and running (writting to disk)
> and no other hypervisor owns this VM - your hypervisor in question is
> functional. We would like to stay on the cautious side, even if single
> VM is up on hypervisor, it wont be killed.
>
> Try to get the storage back up like 3 times, or for 90 sec or so,
>> then fence the fucker and HA the VM's immediately after confirmation. In
>> fact, that's exactly what it's doing now, with the side note that
>> confirmation can only reasonably follow after the hypervisor is done
>> rebooting.
> Both check interval and sleep in between should be configurable. If you
> dont want to wait 30 and do 3 tests, do just 1 test and set it to 1
> second, its upto the end user what he wants it to be. We are providing
> the framework to achieve this, end user gets to plugin what he feel is
> right.

My remarks were more workflow-related: I'd propose something like
<somethings' wrong> --> <put hv in alert state> --> <best effort to fix> 
--> <fence> --> <confirm> --> <put hv in down state, & HA VM's>

Your proposal to me read something like
<somethings' wrong> --> <check or fence> --> <best effort to fix> --> 
<check or fence> --> <try to fix again> --> <fence> --> <put hv in down 
state, & HA VM's>


>> Finally as mentioned you're not solving the 'o look, my storage is gone,
>> let's fence' * (N) problem; in the case of a failing NFS:
>> Every host will start IPMI resetting every other hypervisor; by then
>> there's a good chance every hypervisor in all connected clusters are
>> rebooting, leaving a state where there's no hypervisors in the cluster
>> to fence others; that in turn should lead to the cluster falling in
>> maintenance state, which will lead to even more bells & whistles going off.
>> They'll come back, find the NFS still gone, and continue resetting
> each other like there's no tomorrow
>
> This is a dooms day scenario that should not happen (based on proposed
> implementation).
> 1) This feature should apply only to "Shared" storage setup (not mixed -
> local and shared)

Whenever 'VM's with HA flag' + 'VM's can HA' is involved this mechanism 
should be in place? Even local storage VM's can find themselves in dire 
straits with a need to fence the entire hypervisor to get them running 
again.

> 2) If you loose storage underneath as you propose the (entire array),
> all hosts in the cluster will be down. Therefore, no hypervisor
> hypervisors will be qualified to do a neighbor test. The feature we
> propose would not go in shooting spree - because there will be no
> hypervisor that is in "Up" state.
> In addition, if something like that was to occur, another safety check
> should be there. Quoting confluence page:
> "Failure actions must be capped to avoid freak issue shutdowns, for
> example allow no more than 3 power downs within 1 hour -> both options
> configurable, if this condition is met – stop and notify"

Yeah kinda what I said in the 'clusters fence themselves' paradigm. 
That's one of the core things that absolutely needs to be got right IMHO.
I might be biased because I had situations like this happening to me in 
a number of different sizable clusters, which was really bad for my life 
expectancy ;)

>
>> Support staff already panicking over the NFS/network outage now has to
>> deal with entire clusters of hypervisors in perpetual reboot as well as
>> clusters which are completely unreachable because there's no one left to
>> check state; this all while the outage might simply require the revert
>> of some inadvertent network ACL snafu
>
> In current implementation, cloudstack agents will commit suicide, we
> would for this to change as well and make it user configurable.

I'd be rather careful with that. We need to find the right balance 
between configurables (timeouts, retries, fencing mechanisms & 
authorities) and irresponsible users (no fencing 'cauz' that's 
bothersome, but try HA nonetheless so my VM's are back up quickly, then 
start complaining loudly about CloudStack when a small number of 'em 
have borked up their data)

>> Although I well understand Simon Weller's concerns regarding agent
>> complexity in this regard, quorum is the standard way of solving that
>> problem. On the other hand, once the Agents start talking to each other
>> and the Manager over some standard messaging API/bus this problem might
>> well be solved for you; getting, say, Gossip or Paxos or any other
>> clustering/quorum protocol shouldn't be that hard considering the amount
>> of Java software already doing just that out there.
>
> Gossip can be a possibility if it solves our challenge. We are looking
> for the lowest common denominator that is also most accurate, hence we
> propose "write check". Gossip or another solution, maybe a much larger
> undertaking.

Well Gossip is just one idea that came to mind. The larger point (which 
as you note would me a much larger undertaking) is that there's a bunch 
of projects, also under the Apache umbrella, which CloudStack could 
leverage rather than trying to do this really tough stuff all over again.

Heck, considering the possible savings in developer load, CloudStack 
might even integrate or start deeply leveraging stuff like Apache Mesos 
or Stratos.

>
>> Another idea would be to introduce some other kind of storage
>> monitoring, for example by a SystemVM or something.
>> If you'll insist on the 'clusters fence themselves' paradigm, you could
>> maybe also introduce a constraint that a node is only allowed to fence
>> others if itself is healthy; ergo if it doesn't have all storages
>> available, it doesn't get to fence others whose storage isn't available.
>>
Kinda what you already said above ;)