cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bjoern Teipel (JIRA)" <>
Subject [jira] [Commented] (CLOUDSTACK-5859) [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline
Date Thu, 27 Mar 2014 05:40:16 GMT


Bjoern Teipel commented on CLOUDSTACK-5859:

I personally don't see any reason for rebooting a hyper visor if NFS is unavailable or timing
out due to IO/Net issues, especially if you have VMs on local or CLVM storage.
I'll patch our installation to not reboot the Hypervisor, since I had a pool of 10 servers
happily rebooting after a VLAN configuration error which ran also CLVM with fencing on top.
Was not fun to fix. And those behavior does't exist on Xenserver to my knowledge

> [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline
> ------------------------------------------------------------------------------------------
>                 Key: CLOUDSTACK-5859
>                 URL:
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM
>    Affects Versions: 4.2.0
>         Environment: RHEL/CentOS 6.4 with KVM
>            Reporter: Dave Garbus
>            Priority: Critical
> We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs
use local hypervisor storage, with the exception of one that was configured to use NFS-based
primary storage with a HA service offering.
> An issue occurred with the SAN responsible for serving the NFS mount (primary storage
for HA VM) and the mount was put into a read-only state. Shortly after, each host in the cluster
rebooted and continued to stay in a reboot loop until I put the primary storage into maintenance.
These messages were in the agent.log on each of the KVM hosts:
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write
heartbeat failed: timeout, retry: 4
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write
heartbeat failed: timeout; reboot the host
> In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that
was hosting a number of VMs with local storage. It would seem that the fencing script needs
to be improved to account for cases where both local and shared storage is used.

This message was sent by Atlassian JIRA

View raw message