cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Mikhailovsky <>
Subject Re: Disable HA temporary ?
Date Mon, 16 Feb 2015 12:16:09 GMT
I had similar issues at least two or thee times. The host agent would disconnect from the management
server. The agent would not connect back to the management server without manual intervention,
however, it would happily continue running the vms. The management server would initiate the
HA and fire up vms, which are already running on the disconnected host. I ended up with a
handful of vms and virtual routers being ran on two hypervisors, thus corrupting the disk
and having all sorts of issues ((( . 

I think there has to be a better way of dealing with this case. At least on an image level.
Perhaps a host should keep some sort of lock file or a file for every image where it would
record a time stamp. Something like: 

f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and 

Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp
is the image's time stamp. 

The hypervisor should record the time stamp in this file while the vm is running. Let's say
every 5-10 seconds. If the timestamp is old, we can assume that the volume is no longer used
by the hypervisor. 

When a vm is started, the timestamp file should be checked and if the timestamp is recent,
the vm should not start, otherwise, the vm should start and the timestamp file should be regularly

I am sure there are better ways of doing this, but at least this method should not allow two
vms running on different hosts to use the same volume and corrupt the data. 

In ceph, as far as I remember, a new feature is being developed to provide a locking mechanism
of an rbd image. Not sure if this will do the job? 


----- Original Message -----

> From: "Wido den Hollander" <>
> To:
> Sent: Monday, 16 February, 2015 11:32:13 AM
> Subject: Re: Disable HA temporary ?

> On 16-02-15 11:00, Andrija Panic wrote:
> > Hi team,
> >
> > I just had funny behaviour few days ago - one of my hosts was under
> > heavy
> > load (some disk/network load) and it went disconnected from MGMT
> > server.
> >
> > Then MGMT server stared doing HA thing, but without being able to
> > make sure
> > that the VMs on the disconnected hosts are really shutdown (and
> > they were
> > NOT).
> >
> > So MGMT started again some VMs on other hosts, thus resulting in
> > having 2
> > copies of the same VM, using shared strage - so corruption happened
> > on the
> > disk.
> >
> > Is there a way to temporary disable HA feature on global level, or
> > anything
> > similar ?

> Not that I'm aware of, but this is something I also ran in to a
> couple
> of times.

> It would indeed be nice if there could be a way to stop the HA
> process
> completely as an Admin.

> Wido

> > Thanks
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message