cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ilya <ilya.mailing.li...@gmail.com>
Subject Re: Is com.cloud.hypervisor.kvm.resource.KVMHAChecker used by CloudStack?
Date Tue, 12 Jul 2016 03:15:20 GMT
Rohan

As of now:
Disconnect the primary NFS from your KVM and see what happens.

In the future release:

Also, HA piece is being rewritten now. The specs are posted by John
Burwell (and me to a smaller extent) if you search cloudstack mailing
lists via markmail.org for "KVM HA" you can see the thread with many
details.

In summary, we will be changing the behavior to something more precise -
similar to how VmWare does it.

Example: host A, B and C are part of 1 cluster that use a common
clustered storage

host A hangs and halts the VMs ability to write to disk (or crash the vms)

CloudStack MS will retreive the list of volumes used by VMs for host A
ask the neighbor host B to check for when the last write has been
performed.

If all VMs with their disks have no disk activity for predefined
interval (several intervals), cloudstack MS will use IMPI interface to
shoot the node in the head.

This is a very high level overview - there is alot more to this with
many safeguards and tun-able parameters.

Regards
ilya


On 7/11/16 5:33 PM, Rohan T wrote:
> Hi All,
> 
> Having been smashed by the unexpected behaviour of the KVM Heartbeat / HA
> process, we've been working through the logic of the process, and  I now
> believe the intent of the process is sumarised by:
> 
> 
> =================
> The heartbeat process consists of 3 parts:
> 
> 1. a shell script that's distributed to each of the hypervisors during the
> CloudStack installation process:
> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> 2. Two java classes, built into CloudStack
> com.cloud.hypervisor.kvm.resource.KVMHAMonitor
> com.cloud.hypervisor.kvm.resource.KVMHAChecker
> 
> Behaviour
> 
> Each of the classes periodically calls the kvmheartbeat.sh script with
> different arguments, the script is used to confirm the existence of NFS
> mounts,  remount any that are missing, clean up (i.e. kill) VMs in
> indeterminate state, read and write heartbeats to NFS volumes and force the
> host hypervisor to reboot (as part of a "shoot the node in the head"
> approach to restoring sanity to the cluster).
> 
> The KVMHAMonitor script writes a timestamp to each of the NFS volumes
> (pools), each minute,  if this process times out  (4 times), then calls the
> script once more to force a spontaneous reboot of the host (via: echo b >
> /proc/sysrq_trigger).
> 
> The KVMHAChecker is responsible for triggering the script to read the
> heartbeat value and compare with the current timestamp. Where ALL NFS
> volumes are determined to be "DEAD" (i.e timestamp is older than 60
> seconds),
> 
> ================
> 
> Is my understanding correct?
> 
> The problem is, when testing this logic in my test lab (currently 4.4.4,
> but there's been no significant updates committed to these files since),
> I've been unable to see any evidence of the KVMHAChecker actually
> executing!  I see plenty of evidence of heartbeat writes (and of hypervisor
> reboots triggered when this process timesout).
> 
> 
> Thanks,
> Rohan
> 

Mime
View raw message