Return-Path: X-Original-To: apmail-cloudstack-issues-archive@www.apache.org Delivered-To: apmail-cloudstack-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B27651195A for ; Mon, 8 Sep 2014 14:05:29 +0000 (UTC) Received: (qmail 53936 invoked by uid 500); 8 Sep 2014 14:05:29 -0000 Delivered-To: apmail-cloudstack-issues-archive@cloudstack.apache.org Received: (qmail 53894 invoked by uid 500); 8 Sep 2014 14:05:29 -0000 Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list issues@cloudstack.apache.org Received: (qmail 53883 invoked by uid 500); 8 Sep 2014 14:05:29 -0000 Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org Received: (qmail 53880 invoked by uid 99); 8 Sep 2014 14:05:29 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2014 14:05:29 +0000 Date: Mon, 8 Sep 2014 14:05:29 +0000 (UTC) From: "Daan Hoogland (JIRA)" To: cloudstack-issues@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Issue Comment Deleted] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daan Hoogland updated CLOUDSTACK-7184: -------------------------------------- Comment: was deleted (was: Hi, I am currently out of office and will be back Wednesday the 27th of August. During this time I will have limited access to e-mail and might not be able to take your call. For urgent matter regarding ASR please contact int-asr@schubergphilis.com instead. For other urgent matter please contact one of my colleagues. Kind regards, Joris van Lieshout Schuberg Philis schubergphilis.com +31207506672 +31651428188 ) > HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down > ------------------------------------------------------------------------------------------------------------ > > Key: CLOUDSTACK-7184 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the default.) > Components: Hypervisor Controller, Management Server, XenServer > Affects Versions: 4.3.0, 4.4.0, 4.5.0 > Environment: CloudStack 4.3 with XenServer 6.2 hypervisors > Reporter: Remi Bergsma > Priority: Blocker > > Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did discover this and marked the host as down, and immediately started HA. Just 18 seconds later the hypervisor returned and we ended up with 5 vm's that were running on two hypervisors at the same time. > This, of course, resulted in file system corruption and the loss of the vm's. One side of the story is why XenServer allowed this to happen (will not bother you with this one). The CloudStack side of the story: HA should only start after at least xen.heartbeat.interval seconds. If the host is down long enough, the Xen heartbeat script will fence the hypervisor and prevent corruption. If it is not down long enough, nothing should happen. > Logs (short): > 2014-07-25 05:03:28,596 WARN [c.c.a.m.DirectAgentAttache] (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX) > ..... > 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX. Starting HA on the VMs > ..... > 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 505, name = mccpvmXX] > cs marks host down: 2014-07-25 05:03:31,920 > cs marks host up: 2014-07-25 05:03:49,655 -- This message was sent by Atlassian JIRA (v6.3.4#6332)