Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cloudstack.apache.org
Date: Tue, 16 Jul 2013 18:14:49 +0000 (UTC)
From: "Logan B (JIRA)" <jira@apache.org>
To: cloudstack-issues@incubator.apache.org
Message-ID: <JIRA.12657750.1373903296638.59411.1373998489772@arcas>
In-Reply-To: <JIRA.12657750.1373903296638@arcas>
References: <JIRA.12657750.1373903296638@arcas>
Subject: [jira] [Commented] (CLOUDSTACK-3535) No HA actions are performed
 when a KVM host goes offline
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CLOUDSTACK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710029#comment-13710029 ] 

Logan B commented on CLOUDSTACK-3535:
-------------------------------------

Please note that this bug does not only affect KVM.  We have experienced the same issue with XCP 1.6/XenServer hosts.

The problem stems from a previous fix to prevent a potential split brain issue when the management server loses connectivity to the cluster.  The AgentImpl function used to mark the host as down when it couldn't be reached, now it just marks it at "unable to determine state" and does nothing.  This does fix the split brain issue, but if the hosts actually goes down then HA will never take over.

I realize this is a tricky fix, and my programming knowledge is minimal, but I do have a suggestion for a fix.  The only time the management server should run into an actual split brain issue is if it loses connectivity to the clusters.  Could the following logic be implemented?

( I apologize for the potentially confusing formatting.)

If: Management server cannot ping host:
-> Then: Try to ping management gateway.
--> If: Management server CAN ping gateway:
---> Then: Try to ping other hosts in cluster:
----> If: Other hosts can be pinged AND gateway can be pinged:
-----> Then: Start HA and send host down report/alert.
----> Else If: Other hosts CANNOT be pinged AND gateway CAN be pinged:
-----> Then: Send cluster connectivity alert, and do nothing with HA.
--> Else If: Management server CANNOT ping gateway:
---> Then: Attempt to send management connectivity alert, and do nothing with HA.

The only time I could see this causing an issue if if the networking for Host A goes down, HA migrates VMs to Host B, then Host A's networking comes back up with running VMs.  I don't see this being a very likely scenario though.

A short term solution would be to at least trigger some sort of alert/e-mail when the host status cannot be determined.  That way manual intervention can be started much more quickly.  Right now a host can be offline indefinitely without any notice.
                
> No HA actions are performed when a KVM host goes offline
> --------------------------------------------------------
>
>                 Key: CLOUDSTACK-3535
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Hypervisor Controller, KVM, Management Server
>    Affects Versions: 4.1.0, Future
>         Environment: KVM (CentOS 6.3) with CloudStack 4.1
>            Reporter: Paul Angus
>
> If a KVM host 'goes down', CloudStack does not perform HA for instances which are marked as HA enabled on that host (including system VMs)
> CloudStack does not show the host as disconnected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira