cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darren Shepherd <>
Subject HA is broken on master
Date Wed, 02 Oct 2013 22:36:30 GMT

In scheduleRestart() when it calls _itMgr.advanceStop() it used to
pass the VO.  Now it passes a UUID.  So the VO the HA manager holds is
out of sync with the DB and the recorded previous state and update
count are wrong, so HA will just stop the VM in the worker.

I really think the update count approach is far too fragile.  For
example, currently if you try to start a VM and it fails, the update
count will change.  But the current code will record the new update
count so the next try it will have the updated count.  I can see the
following issue, maybe there's some work around for it.  Imagine you
have a large failure, the stuff really hits the fan.  So you have
1000's of HA jobs trying to run and things just keep failing.  So to
stop the churn you shutdown the mgmt stack to figure out whats up with
infrastructure.  There's a really good chance that you would kill the
mgmt stack while a VM was in starting.  So now the hawork update count
will be out of sync with the current DB.  So when you bring the mgmt
stack back up.  It won't try to restart that VM.

Maybe that situation is taken care of somehow, but I could probably
dream up another one.  I think it is far simpler that when a user
starts a VM, you record in the vm_instance table, in a new column,
"Should be running", then when the HA worker processes the record, it
will always say it should be running.  If the user does a stop, you
clear that column.  This has the added benefit of when things are bad
and a user starts clicking restart/start, they won't mess with the HA.
 I think, maybe things have changed, but before what I would see is
that we'd have an issue so VMs should be started, but weren't.  So HA
was trying, but it kept failing.  The user would login and see they're
VM is down, so they would click start.  But that would fail (similar
to how HA was also failing).  So the VM would stay in stopped, but
since they touched the VM, the update count changed and HA wouldn't
start it back up when the infra worked again.  So customers who
proactively tried to do something would get penalized in that their
downtime was longer because cloudstack wouldn't bring their VM back up
like the other VMs.


View raw message