cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Mikhailovsky <and...@arhont.com>
Subject Re: ALARM - ACS reboots host servers!!!
Date Mon, 03 Mar 2014 12:24:39 GMT

Nux, 


I am using HA for about 30% of the guest vms, but my testing showed that HA is not working
reliably with KVM. It works pretty well if you initiate a vm shutdown inside a guest without
using the ACS GUI. However, when the host goes down for whatever reason (power failure, init
6/0, network failure, etc.) the HA fails to kick in and restart the vms. 


Regarding the nfs storage, I did not put the nfs server in the maintenance mode. Would this
solve the problem with reboots? I will try it next time when I am doing maintenance on the
nfs, but I do recall that i've previously restarted the nfs server in the past and I've not
seen the hosts rebooting themselves. Is there a timeout which causes the hosts to reboot?



In any case, I think it is not safe to do an automated host server reboot and if it was up
to me I would disable this feature from the agent. IMHO this should be down to system administrator
and acs agent should send an alert email if something goes wrong instead of rebooting the
host servers. 


I am using ceph for my primary storage for guest vms data and root disks. The NFS is used
as a backup disk offering for the guest. 


Andrei 


----- Original Message -----

From: "Nux!" <nux@li.nux.ro> 
To: users@cloudstack.apache.org 
Sent: Sunday, 2 March, 2014 10:24:07 PM 
Subject: Re: ALARM - ACS reboots host servers!!! 

On 02.03.2014 21:17, Andrei Mikhailovsky wrote: 
> Hello guys, 
> 
> 
> I've recently came across the bug CLOUDSTACK-5429 which has rebooted 
> all of my host servers without properly shutting down the guest vms. 
> I've simply upgraded and rebooted one of the nfs primary storage 
> servers and a few minutes later, to my horror, i've found out that all 
> of my host servers have been rebooted. Is it just me thinking so, or 
> is this bug should be fixed ASAP and should be a blocker for any new 
> ACS release. I mean not only does it cause downtime, but also possible 
> data loss and server corruption. 

Hi Andrei, 

Do you have HA enabled and did you put that primary storage in 
maintenance mode before rebooting it? 
It's my understanding that ACS relies on the shared storage to perform 
HA so if the storage goes it's expected to go berserk. I've noticed 
similar behaviour in Xenserver pools without ACS. 
I'd imagine a "cure" for this would be to use network distributed 
"filesystems" like GlusterFS or CEPH. 

Lucian 

-- 
Sent from the Delta quadrant using Borg technology! 

Nux! 
www.nux.ro 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message