cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wido den Hollander (JIRA)" <>
Subject [jira] [Commented] (CLOUDSTACK-8643) Helper for KVM High Availability
Date Thu, 16 Jul 2015 14:49:05 GMT


Wido den Hollander commented on CLOUDSTACK-8643:

This issue was created due to the things I noticed in these two issues.

> Helper for KVM High Availability
> --------------------------------
>                 Key: CLOUDSTACK-8643
>                 URL:
>             Project: CloudStack
>          Issue Type: Improvement
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM, Management Server
>         Environment: KVM hypervisors
>            Reporter: Wido den Hollander
>              Labels: fence, high-availability, kvm, libvirt
>             Fix For: Future
> When running KVM with NFS storage all Agents will write a heartbeat to the NFS.
> Should a Agent go down, it will still be writing heartbeats even if libvirt has died.
> Using these heartbeats the Management Server can ask other KVM Agents if the other server
is still beating. If not, it can fence it.
> While this works I've also encountered scenarios where you run without NFS and still
want investigators.
> My proposal would be a Agent Helper running NEXT to the Agent it self.
> A simple Python daemon running a Basic HTTP server which queries libvirt every X seconds
> * Running Instances
> * Storage pools
> If keeps this in memory, so that even when libvirt goes down it knows what the last state
> Using the Qemu Monitor sockets we can actually see if the guests we have in memory are
still online.
> If they are we simply keep the list.
> Now, if a investigator comes by and wants to know if the host is still up it can ALSO
ask the helper.
> The management server can ask the helper, but the other agents could as well.
> This doesn't work in all cases, eg where storage is lost. But a additional helper would
be useful to catch scenarios where the Agent itself became unresponsive.

This message was sent by Atlassian JIRA

View raw message