cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error
Date Tue, 30 Oct 2018 12:28:00 GMT


ASF GitHub Bot commented on CLOUDSTACK-10310:

somejfn commented on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage issue
   On problem is while NFS is unavailable,  you wont be able to destroy the
   VM.... libvirt will just hang.  So if you attempt to destroy the and start
   a new VM,  when the NFS service comes back online you will get the
   duplicate VM.  That's why I would rather just wait for the NFS issue to go
   away rather than fire VM-HA in that case.
   On Tue, Oct 30, 2018 at 5:32 AM Paul Angus <> wrote:
   > IMHO i'd say if a VM on that storage is marked as ha-enabled, it should be
   > powered-off and restarted somewhere else, and if it isn't HA enabled, we
   > shouldn't do anything with the running VM (as it's for the user of the VM
   > to deal with it),
   > in either case we should probably set the host to 'alert' so then an admin
   > can see it and do something about it.
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <>,
   > or mute the thread
   > <>
   > .

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> KVM hosts reboot if there is a short transient storage error
> ------------------------------------------------------------
>                 Key: CLOUDSTACK-10310
>                 URL:
>             Project: CloudStack
>          Issue Type: Improvement
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM
>    Affects Versions: 4.9.0,
>            Reporter: Sean Lair
>            Priority: Major
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus taking
down all VMs running on it.  The code does try 5x times before the reboot, but the there
is not a delay between the retires, so they are 5 simultaneous retries, which doesn't help. 
Standard SAN storage HA operations or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just commenting
out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between tries so
it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, the cloudstack-agent
could just be stopped on the host instead.  This will cause alerts to be issued and if the
host is disconnected long-enough, depending on the HA code in use, VM HA could handle the
host failure.
> The built-in reboot of the host seemed drastic

This message was sent by Atlassian JIRA

View raw message