cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Lair (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error
Date Wed, 28 Feb 2018 00:31:00 GMT
Sean Lair created CLOUDSTACK-10310:
--------------------------------------

             Summary: KVM hosts reboot if there is a short transient storage error
                 Key: CLOUDSTACK-10310
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
             Project: CloudStack
          Issue Type: Improvement
      Security Level: Public (Anyone can view this level - this is the default.)
          Components: KVM
    Affects Versions: 4.10.0.0, 4.9.0
            Reporter: Sean Lair


If the KVM heartbeat file can't be written to, the host is rebooted, and thus taking down
all VMs running on it.  The code does try 5x times before the reboot, but the there is not
a delay between the retires, so they are 5 simultaneous retries, which doesn't help.  Standard
SAN storage HA operations or quick network blip could cause this reboot to occur.

Some discussions on the dev mailing list revealed that some people are just commenting out
the reboot line in their version of the CloudStack source.

A better option (and a new PR is being issued) would be have it sleep between tries so it
isn't 5x almost simultaneous tries.  Plus, instead of rebooting, the cloudstack-agent could
just be stopped on the host instead.  This will cause alerts to be issued and if the host
is disconnected long-enough, depending on the HA code in use, VM HA could handle the host
failure.

The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message