cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Garbus (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CLOUDSTACK-5859) [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline
Date Sun, 12 Jan 2014 20:55:51 GMT

     [ https://issues.apache.org/jira/browse/CLOUDSTACK-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dave Garbus updated CLOUDSTACK-5859:
------------------------------------

    Description: 
We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs use
local hypervisor storage, with the exception of one that was configured to use NFS-based primary
storage with a HA service offering.

An issue occurred with the disk responsible for serving the NFS mount (primary storage for
HA VM) and the mount was put into a read-only state. Shortly after, each host in the cluster
rebooted and continued to stay in a reboot loop until I put the primary storage into maintenance.
These messages were in the agent.log on each of the KVM hosts:

2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat
failed: timeout, retry: 4
2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat
failed: timeout; reboot the host

In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that was hosting
a number of VMs with local storage. It would seem that the fencing script needs to be improved
to account for cases where both local and shared storage is used.

  was:
We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs use
local hypervisor storage, with the exception of one that was configured to use NFS-based primary
storage with a HA service offering.

An issue occurred with the disk responsible for serving the NFS mount and the mount was put
into a read-only state. Shortly after, each host in the cluster rebooted and continued to
stay in a reboot loop until I put the primary storage into maintenance. These messages were
in the agent.log on each of the KVM hosts:

2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat
failed: timeout, retry: 4
2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat
failed: timeout; reboot the host

In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that was hosting
a number of VMs with local storage. It would seem that the fencing script needs to be improved
to account for cases where both local and shared storage is used.


> [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline
> ------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-5859
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-5859
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM
>    Affects Versions: 4.2.0
>         Environment: RHEL/CentOS 6.4 with KVM
>            Reporter: Dave Garbus
>            Priority: Critical
>
> We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs
use local hypervisor storage, with the exception of one that was configured to use NFS-based
primary storage with a HA service offering.
> An issue occurred with the disk responsible for serving the NFS mount (primary storage
for HA VM) and the mount was put into a read-only state. Shortly after, each host in the cluster
rebooted and continued to stay in a reboot loop until I put the primary storage into maintenance.
These messages were in the agent.log on each of the KVM hosts:
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write
heartbeat failed: timeout, retry: 4
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write
heartbeat failed: timeout; reboot the host
> In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that
was hosting a number of VMs with local storage. It would seem that the fencing script needs
to be improved to account for cases where both local and shared storage is used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message