cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohan T <rohani...@gmail.com>
Subject Is com.cloud.hypervisor.kvm.resource.KVMHAChecker used by CloudStack?
Date Tue, 12 Jul 2016 00:33:19 GMT
Hi All,

Having been smashed by the unexpected behaviour of the KVM Heartbeat / HA
process, we've been working through the logic of the process, and  I now
believe the intent of the process is sumarised by:


=================
The heartbeat process consists of 3 parts:

1. a shell script that's distributed to each of the hypervisors during the
CloudStack installation process:
/usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
2. Two java classes, built into CloudStack
com.cloud.hypervisor.kvm.resource.KVMHAMonitor
com.cloud.hypervisor.kvm.resource.KVMHAChecker

Behaviour

Each of the classes periodically calls the kvmheartbeat.sh script with
different arguments, the script is used to confirm the existence of NFS
mounts,  remount any that are missing, clean up (i.e. kill) VMs in
indeterminate state, read and write heartbeats to NFS volumes and force the
host hypervisor to reboot (as part of a "shoot the node in the head"
approach to restoring sanity to the cluster).

The KVMHAMonitor script writes a timestamp to each of the NFS volumes
(pools), each minute,  if this process times out  (4 times), then calls the
script once more to force a spontaneous reboot of the host (via: echo b >
/proc/sysrq_trigger).

The KVMHAChecker is responsible for triggering the script to read the
heartbeat value and compare with the current timestamp. Where ALL NFS
volumes are determined to be "DEAD" (i.e timestamp is older than 60
seconds),

================

Is my understanding correct?

The problem is, when testing this logic in my test lab (currently 4.4.4,
but there's been no significant updates committed to these files since),
I've been unable to see any evidence of the KVMHAChecker actually
executing!  I see plenty of evidence of heartbeat writes (and of hypervisor
reboots triggered when this process timesout).


Thanks,
Rohan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message