Return-Path: X-Original-To: apmail-cloudstack-issues-archive@www.apache.org Delivered-To: apmail-cloudstack-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 27134CCBF for ; Tue, 16 Jul 2013 18:14:51 +0000 (UTC) Received: (qmail 98326 invoked by uid 500); 16 Jul 2013 18:14:50 -0000 Delivered-To: apmail-cloudstack-issues-archive@cloudstack.apache.org Received: (qmail 97961 invoked by uid 500); 16 Jul 2013 18:14:50 -0000 Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list issues@cloudstack.apache.org Received: (qmail 97709 invoked by uid 500); 16 Jul 2013 18:14:50 -0000 Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org Received: (qmail 97444 invoked by uid 99); 16 Jul 2013 18:14:50 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jul 2013 18:14:50 +0000 Date: Tue, 16 Jul 2013 18:14:49 +0000 (UTC) From: "Logan B (JIRA)" To: cloudstack-issues@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CLOUDSTACK-3535) No HA actions are performed when a KVM host goes offline MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CLOUDSTACK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710029#comment-13710029 ] Logan B commented on CLOUDSTACK-3535: ------------------------------------- Please note that this bug does not only affect KVM. We have experienced the same issue with XCP 1.6/XenServer hosts. The problem stems from a previous fix to prevent a potential split brain issue when the management server loses connectivity to the cluster. The AgentImpl function used to mark the host as down when it couldn't be reached, now it just marks it at "unable to determine state" and does nothing. This does fix the split brain issue, but if the hosts actually goes down then HA will never take over. I realize this is a tricky fix, and my programming knowledge is minimal, but I do have a suggestion for a fix. The only time the management server should run into an actual split brain issue is if it loses connectivity to the clusters. Could the following logic be implemented? ( I apologize for the potentially confusing formatting.) If: Management server cannot ping host: -> Then: Try to ping management gateway. --> If: Management server CAN ping gateway: ---> Then: Try to ping other hosts in cluster: ----> If: Other hosts can be pinged AND gateway can be pinged: -----> Then: Start HA and send host down report/alert. ----> Else If: Other hosts CANNOT be pinged AND gateway CAN be pinged: -----> Then: Send cluster connectivity alert, and do nothing with HA. --> Else If: Management server CANNOT ping gateway: ---> Then: Attempt to send management connectivity alert, and do nothing with HA. The only time I could see this causing an issue if if the networking for Host A goes down, HA migrates VMs to Host B, then Host A's networking comes back up with running VMs. I don't see this being a very likely scenario though. A short term solution would be to at least trigger some sort of alert/e-mail when the host status cannot be determined. That way manual intervention can be started much more quickly. Right now a host can be offline indefinitely without any notice. > No HA actions are performed when a KVM host goes offline > -------------------------------------------------------- > > Key: CLOUDSTACK-3535 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3535 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the default.) > Components: Hypervisor Controller, KVM, Management Server > Affects Versions: 4.1.0, Future > Environment: KVM (CentOS 6.3) with CloudStack 4.1 > Reporter: Paul Angus > > If a KVM host 'goes down', CloudStack does not perform HA for instances which are marked as HA enabled on that host (including system VMs) > CloudStack does not show the host as disconnected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira