Return-Path: X-Original-To: apmail-cloudstack-issues-archive@www.apache.org Delivered-To: apmail-cloudstack-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2A35711D5D for ; Fri, 6 Jun 2014 15:38:07 +0000 (UTC) Received: (qmail 39780 invoked by uid 500); 6 Jun 2014 15:38:01 -0000 Delivered-To: apmail-cloudstack-issues-archive@cloudstack.apache.org Received: (qmail 39750 invoked by uid 500); 6 Jun 2014 15:38:01 -0000 Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list issues@cloudstack.apache.org Received: (qmail 39742 invoked by uid 500); 6 Jun 2014 15:38:01 -0000 Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org Received: (qmail 39739 invoked by uid 99); 6 Jun 2014 15:38:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Jun 2014 15:38:01 +0000 Date: Fri, 6 Jun 2014 15:38:01 +0000 (UTC) From: "c-hemp (JIRA)" To: cloudstack-issues@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CLOUDSTACK-6857) Losing the connection from CloudStack Manager to the agent will force a shutdown when connection is re-established MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CLOUDSTACK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] c-hemp updated CLOUDSTACK-6857: ------------------------------- Description: If a physical host is not pingable that host goes into alert mode. If the physical hosts is unreachable, the virtual router is either unreachable or unable to ping a virtual on the physical host, and the manager is unable to ping the virtual instance it assumes the virtual is down and puts it into a stop state. When the connection is restablished, it gets the state from the database, sees that it is now in a stopped state, and will then shutdown the instance. This behavior can cause major outages if there is any type of network loss once the connectivity comes back. This is especially critical when using CloudStack across multiple colos. The logs when it happens: 14-06-06 02:01:22,259 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) PingInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Not a System Vm, unable to determine state of VM[User|cephvmstage013] returning null 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Testing if VM[User|cephvmstage013] is alive 2014-06-06 02:01:22,260 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|cephvmstage013] returning null 2014-06-06 02:01:22,260 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) ManagementIPSysVMInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) KVMInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) HypervInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) KVMInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) HypervInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,584 WARN [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Unable to actually stop VM[User|cephvmstage013] but continue with release because it's a force stop 2014-06-06 02:01:22,585 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) VM[User|cephvmstage013] is stopped on the host. Proceeding to release resource held. 2014-06-06 02:01:22,648 WARN [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Unable to actually stop VM[User|cephvmstage013] but continue with release because it's a force stop 2014-06-06 02:01:22,650 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) VM[User|cephvmstage013] is stopped on the host. Proceeding to release resource held. 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released network resources for the vm VM[User|cephvmstage013] 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released storage resources for the vm VM[User|cephvmstage013] 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Successfully released network resources for the vm VM[User|cephvmstage013] 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Successfully released storage resources for the vm VM[User|cephvmstage013] The behavior should change to be set into an alert state, then once connectivity is re-established, if the instance is up, update the manager with the running status was: If a physical host is not pingable that host goes into alert mode. If the physical hosts is unreachable, the virtual router is either unreachable or unable to ping a virtual on the physical host, and the manager is unable to ping the virtual instance it assumes the host is down and puts it into a stop state. When the connection is restablished, it gets the state from the database, sees that it is now in a stopped state, and will then shutdown the instance. This behavior can cause major outages if there is any type of network loss once the connectivity comes back. This is especially critical when using CloudStack across multiple colos. The logs when it happens: 14-06-06 02:01:22,259 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) PingInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Not a System Vm, unable to determine state of VM[User|cephvmstage013] returning null 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Testing if VM[User|cephvmstage013] is alive 2014-06-06 02:01:22,260 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|cephvmstage013] returning null 2014-06-06 02:01:22,260 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) ManagementIPSysVMInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) KVMInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) HypervInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) KVMInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) HypervInvestigator found VM[User|cephvmstage013]to be alive? null 2014-06-06 02:01:22,584 WARN [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Unable to actually stop VM[User|cephvmstage013] but continue with release because it's a force stop 2014-06-06 02:01:22,585 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) VM[User|cephvmstage013] is stopped on the host. Proceeding to release resource held. 2014-06-06 02:01:22,648 WARN [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Unable to actually stop VM[User|cephvmstage013] but continue with release because it's a force stop 2014-06-06 02:01:22,650 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) VM[User|cephvmstage013] is stopped on the host. Proceeding to release resource held. 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released network resources for the vm VM[User|cephvmstage013] 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released storage resources for the vm VM[User|cephvmstage013] 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Successfully released network resources for the vm VM[User|cephvmstage013] 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Successfully released storage resources for the vm VM[User|cephvmstage013] The behavior should change to be set into an alert state, then once connectivity is re-established, if the instance is up, update the manager with the running status > Losing the connection from CloudStack Manager to the agent will force a shutdown when connection is re-established > ------------------------------------------------------------------------------------------------------------------ > > Key: CLOUDSTACK-6857 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6857 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the default.) > Components: Management Server > Affects Versions: 4.3.0 > Environment: Ubuntu 12.04 > Reporter: c-hemp > Priority: Critical > > If a physical host is not pingable that host goes into alert mode. If the physical hosts is unreachable, the virtual router is either unreachable or unable to ping a virtual on the physical host, and the manager is unable to ping the virtual instance it assumes the virtual is down and puts it into a stop state. > When the connection is restablished, it gets the state from the database, sees that it is now in a stopped state, and will then shutdown the instance. > This behavior can cause major outages if there is any type of network loss once the connectivity comes back. This is especially critical when using CloudStack across multiple colos. > The logs when it happens: > 14-06-06 02:01:22,259 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) PingInvestigator found VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Not a System Vm, unable to determine state of VM[User|cephvmstage013] returning null > 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Testing if VM[User|cephvmstage013] is alive > 2014-06-06 02:01:22,260 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-1:ctx-be848615 work-1953) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|cephvmstage013] returning null > 2014-06-06 02:01:22,260 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) ManagementIPSysVMInvestigator found VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) KVMInvestigator found VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) HypervInvestigator found VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) KVMInvestigator found VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) HypervInvestigator found VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,584 WARN [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Unable to actually stop VM[User|cephvmstage013] but continue with release because it's a force stop > 2014-06-06 02:01:22,585 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) VM[User|cephvmstage013] is stopped on the host. Proceeding to release resource held. > 2014-06-06 02:01:22,648 WARN [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Unable to actually stop VM[User|cephvmstage013] but continue with release because it's a force stop > 2014-06-06 02:01:22,650 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) VM[User|cephvmstage013] is stopped on the host. Proceeding to release resource held. > 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released network resources for the vm VM[User|cephvmstage013] > 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released storage resources for the vm VM[User|cephvmstage013] > 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Successfully released network resources for the vm VM[User|cephvmstage013] > 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] (HA-Worker-1:ctx-be848615 work-1953) Successfully released storage resources for the vm VM[User|cephvmstage013] > The behavior should change to be set into an alert state, then once connectivity is re-established, if the instance is up, update the manager with the running status -- This message was sent by Atlassian JIRA (v6.2#6252)