cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Sorensen (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CLOUDSTACK-8943) KVM HA is broken, let's fix it
Date Thu, 15 Oct 2015 21:17:05 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959632#comment-14959632
] 

Marcus Sorensen edited comment on CLOUDSTACK-8943 at 10/15/15 9:16 PM:
-----------------------------------------------------------------------

"Having to wait for the affected HV which triggered this to come back and declare it's not
running VMs is a bad idea; this HV could require hours or days of maintenance!"


You don't have to wait for the hypervisor to come back. If you have a hypervisor with issues,
remove it from the cluster. This will mark all VMs that were on the hypervsior as "Stopped",
and they'll be started elsewhere. Later, when the broken hypervisor is fixed, if the agent.properties
file has not changed, it will re-add itself back into the cluster.


was (Author: mlsorensen):
Also, you don't have to wait for the hypervisor to come back. If you have a hypervisor with
issues, remove it from the cluster. This will mark all VMs that were on the hypervsior as
"Stopped", and they'll be started elsewhere. Later, when the broken hypervisor is fixed, if
the agent.properties file has not changed, it will re-add itself back into the cluster.

> KVM HA is broken, let's fix it
> ------------------------------
>
>                 Key: CLOUDSTACK-8943
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>         Environment: Linux distros with KVM/libvirt
>            Reporter: Nux
>
> Currently KVM HA works by monitoring an NFS based heartbeat file and it can often fail
whenever this network share becomes slower, causing the hypervisors to reboot.
> This can be particularly annoying when you have different kinds of primary storages in
place which are working fine (people running CEPH etc).
> Having to wait for the affected HV which triggered this to come back and declare it's
not running VMs is a bad idea; this HV could require hours or days of maintenance!
> This is embarrassing. How can we fix it? Ideas, suggestions? How are other hypervisors
doing it?
> Let's discuss, test, implement. :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message