cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Francois Nadeau (JIRA)" <>
Subject [jira] [Created] (CLOUDSTACK-10397) Transient NFS access issues should not result in duplicate VMs or KVM hosts resets
Date Wed, 24 Oct 2018 09:46:00 GMT
Jean-Francois Nadeau created CLOUDSTACK-10397:

             Summary: Transient NFS access issues should not result in duplicate VMs or KVM
hosts resets
                 Key: CLOUDSTACK-10397
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
          Components: cloudstack-agent, Hypervisor Controller
    Affects Versions:
            Reporter: Jean-Francois Nadeau

Under CentOS 7.x with KVM and NFS as primary storage,  we expect to tolerate and recover
from temporary disconnection from primary storage.  We simulate this with iptables from the
KVM host using a DROP rule in the input and output chains to the NFS servers IP. 


The observation under 4.11.2 is that an NFS  disconnection of more than 5 minutes will

With VM HA enabled and host HA disabled:   Cloudstack agent will often block refreshing
primary storage and go in Down state from the controller perspective.  Controller will restart
VMs on other hosts creating duplicate VMs on the network and possibly corrupt VM root disk
if the transient issue goes away.


With VM HA enabled and host HA disabled: Same agent issue can cause it to block and will end
in either Disconnect or Down state.  Host HA framework will reset the KVM hosts after the kvm.ha._degraded_._max_.period
.  The problem here is that,  yes the host HA does ensure we don't have dup VMs but at scale
this would also provoke a lot of KVM host resets (if not all of them). 


On 4.9.3 the cloudstack agent will simply "hang" in there and the controller would not see
the KVM host down (at least for 60 minutes).  When the network issue blocking NFS  access
is resolved all KVM hosts and VMs just resume working with no large scale fencing happening.

The same resilience is expected on 4.11.x .  This a a blocker for an upgrade from 4.9, 
considering we are more at risk on 4.11 with VM HA enabled and regardless of if host HA is

This message was sent by Atlassian JIRA

View raw message