cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "haijiao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-5859) [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline
Date Thu, 09 Apr 2015 08:44:12 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486983#comment-14486983
] 

haijiao commented on CLOUDSTACK-5859:
-------------------------------------

We had hit the similar issue too.

KVM VMs configured as 'HA'  within one cluster are able to access 2 NFS primary storages.
(1# and 2#).

While 2# storage accidently became inaccessible (due to incorrect permission setting),  all
the hosts within that cluster kept rebooting with message below until we corrected the setting.

It seems the design here could be further improved.  CloudStack shall check if any other storage
attached to these VMs is still accessible.  The script 'kvmheartbeat.sh' should NOT reboot
the hosts as long as one shared storage is still working, since the root cause is obviously
not the 'host' but  something else.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2015-04-03 14:01:29,555 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) write heartbeat
failed: Failed to create /mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11,
retry: 0
2015-04-03 14:01:29,575 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) write heartbeat
failed: Failed to create /mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11,
retry: 1
2015-04-03 14:01:29,595 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) write heartbeat
failed: Failed to create /mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11,
retry: 2
2015-04-03 14:01:29,614 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) write heartbeat
failed: Failed to create /mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11,
retry: 3
2015-04-03 14:01:29,635 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) write heartbeat
failed: Failed to create /mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11,
retry: 4
2015-04-03 14:01:29,635 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) write heartbeat
failed: Failed to create /mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11;
reboot the host
2015-04-03 14:02:01,246 INFO [cloud.agent.Agent] (AgentShutdownThread:null) Stopping the agent:
Reason = sig.kill

> [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline
> ------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-5859
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-5859
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM
>    Affects Versions: 4.2.0
>         Environment: RHEL/CentOS 6.4 with KVM
>            Reporter: Dave Garbus
>            Priority: Critical
>
> We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs
use local hypervisor storage, with the exception of one that was configured to use NFS-based
primary storage with a HA service offering.
> An issue occurred with the SAN responsible for serving the NFS mount (primary storage
for HA VM) and the mount was put into a read-only state. Shortly after, each host in the cluster
rebooted and continued to stay in a reboot loop until I put the primary storage into maintenance.
These messages were in the agent.log on each of the KVM hosts:
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write
heartbeat failed: timeout, retry: 4
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write
heartbeat failed: timeout; reboot the host
> In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that
was hosting a number of VMs with local storage. It would seem that the fencing script needs
to be improved to account for cases where both local and shared storage is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message