cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wido den Hollander <w...@widodh.nl>
Subject Re: Disable HA temporary ?
Date Mon, 16 Feb 2015 15:05:52 GMT


On 16-02-15 13:16, Andrei Mikhailovsky wrote:
> I had similar issues at least two or thee times. The host agent would disconnect from
the management server. The agent would not connect back to the management server without manual
intervention, however, it would happily continue running the vms. The management server would
initiate the HA and fire up vms, which are already running on the disconnected host. I ended
up with a handful of vms and virtual routers being ran on two hypervisors, thus corrupting
the disk and having all sorts of issues ((( . 
> 
> I think there has to be a better way of dealing with this case. At least on an image
level. Perhaps a host should keep some sort of lock file or a file for every image where it
would record a time stamp. Something like: 
> 
> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and 
> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp 
> 
> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp
is the image's time stamp. 
> 
> The hypervisor should record the time stamp in this file while the vm is running. Let's
say every 5-10 seconds. If the timestamp is old, we can assume that the volume is no longer
used by the hypervisor. 
> 
> When a vm is started, the timestamp file should be checked and if the timestamp is recent,
the vm should not start, otherwise, the vm should start and the timestamp file should be regularly
updated. 
> 
> I am sure there are better ways of doing this, but at least this method should not allow
two vms running on different hosts to use the same volume and corrupt the data. 
> 
> In ceph, as far as I remember, a new feature is being developed to provide a locking
mechanism of an rbd image. Not sure if this will do the job? 
>

Something like this is still on my wishlist for Ceph/RBD, something like
you propose.

For NFS we currently have this in place, but for Ceph/RBD we don't. It's
a matter of code in the Agent and the investigators inside the
Management Server which decide if HA should kick in.

Wido

> Andrei 
> 
> ----- Original Message -----
> 
>> From: "Wido den Hollander" <wido@widodh.nl>
>> To: dev@cloudstack.apache.org
>> Sent: Monday, 16 February, 2015 11:32:13 AM
>> Subject: Re: Disable HA temporary ?
> 
>> On 16-02-15 11:00, Andrija Panic wrote:
>>> Hi team,
>>>
>>> I just had funny behaviour few days ago - one of my hosts was under
>>> heavy
>>> load (some disk/network load) and it went disconnected from MGMT
>>> server.
>>>
>>> Then MGMT server stared doing HA thing, but without being able to
>>> make sure
>>> that the VMs on the disconnected hosts are really shutdown (and
>>> they were
>>> NOT).
>>>
>>> So MGMT started again some VMs on other hosts, thus resulting in
>>> having 2
>>> copies of the same VM, using shared strage - so corruption happened
>>> on the
>>> disk.
>>>
>>> Is there a way to temporary disable HA feature on global level, or
>>> anything
>>> similar ?
> 
>> Not that I'm aware of, but this is something I also ran in to a
>> couple
>> of times.
> 
>> It would indeed be nice if there could be a way to stop the HA
>> process
>> completely as an Admin.
> 
>> Wido
> 
>>> Thanks
>>>
> 

Mime
View raw message