Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3087A2009C6 for ; Sun, 1 May 2016 19:53:46 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 2F21E1609AA; Sun, 1 May 2016 19:53:46 +0200 (CEST) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2A56416010A for ; Sun, 1 May 2016 19:53:45 +0200 (CEST) Received: (qmail 44256 invoked by uid 500); 1 May 2016 17:53:44 -0000 Mailing-List: contact users-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@cloudstack.apache.org Delivered-To: mailing list users@cloudstack.apache.org Received: (qmail 44244 invoked by uid 99); 1 May 2016 17:53:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 01 May 2016 17:53:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 06877C00A4 for ; Sun, 1 May 2016 17:53:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.3 X-Spam-Level: * X-Spam-Status: No, score=1.3 tagged_above=-999 required=6.31 tests=[AC_DIV_BONANZA=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=sg-or-id.20150623.gappssmtp.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id uxJDbMImkmc5 for ; Sun, 1 May 2016 17:53:38 +0000 (UTC) Received: from mail-io0-f173.google.com (mail-io0-f173.google.com [209.85.223.173]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 173B55F1B4 for ; Sun, 1 May 2016 17:53:37 +0000 (UTC) Received: by mail-io0-f173.google.com with SMTP id u185so176150550iod.3 for ; Sun, 01 May 2016 10:53:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sg-or-id.20150623.gappssmtp.com; s=20150623; h=mime-version:date:message-id:subject:from:to; bh=/BQyTkinoLdZrNo8haK2WpyNvT8LBfn0foQrpI2Z9cY=; b=1i3OQFiDWfnQoFXoEv9HI8fDmBvvYjAme9MIuomjTpzd+AM3oeeXC4Io3wSnCit6zU YDdMqEjYsFZTqD1nDFFOJ1UvEbFFgmP7jJBxxCfqtdMnV1Pw2eM7YuqGs0W0bkvcvsHt E30qhMli1KW+ZT9Ca2vZyCJJlye12Mq0GZEkymp4Yp6o+XOAAwWH/KpurURN2yhoem6Z FAGfdPQc9u3K/dsZQ+bE0j2nIelgsN/7V/A0v/Zq5jDKbK1n0DFwqJDxBXZkqT2G1WW9 jMgUH9jKDaSIF0A+H+LxwnPEWYOB2Pb6OnEDgwo2M6eK6QamMsiOw2nWT1SiEUwC2oNr tOvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to; bh=/BQyTkinoLdZrNo8haK2WpyNvT8LBfn0foQrpI2Z9cY=; b=E3n0OmzEVWDnOxf5RtqwvHQtLTe83f1h3ayXMXKU3cEHfsRVf0zIBNtcS1IlKaft+q bczMWT+UC9Yf6b/VBMI91Nb0ioppOeIHJwbhmRpaIYQrWKMVEDm8YLHUbDnc4BLZa2ZG 2HTzgVHWu65p8U6KCdRE1SXcix2uO39jBc3vxks3lkj4pSlqbeHBRb6MH9lzWoVIcAgY trMApVNliva8Cz7YrvW6ryawFX3nu6l3RJDy672Rp1cqaOyE33NW5iYPJBruKhXsvena kWxKk3kN4b7ITLmKyAFnyzAaPnJycjEYMTpVUEL3rc37TtrQwoynp32/tJBEbosyZyJP TfAg== X-Gm-Message-State: AOPr4FXkw63w1dzz43ADxeTU33Wf7ZekC4R8hrUr5UtwWh6AsqKMFyVlrW4qkD7kFqLcrDAXqUbhX42HdH1FEA== MIME-Version: 1.0 X-Received: by 10.107.39.138 with SMTP id n132mr39969813ion.103.1462125209869; Sun, 01 May 2016 10:53:29 -0700 (PDT) Received: by 10.79.98.69 with HTTP; Sun, 1 May 2016 10:53:29 -0700 (PDT) Date: Mon, 2 May 2016 01:53:29 +0800 Message-ID: Subject: CloudStack agent shuts down VMs upon reconnecting to Management server From: Indra Pramana To: "users@cloudstack.apache.org" Content-Type: multipart/alternative; boundary=001a1140eb00c07f680531cb8d36 archived-at: Sun, 01 May 2016 17:53:46 -0000 --001a1140eb00c07f680531cb8d36 Content-Type: text/plain; charset=UTF-8 Dear all, We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. We have been having a specific problem, which has been happening for quite some time (may be from the first day we use CloudStack), which we suspect is related to HA. When a CloudStack agent gets disconnected from the management server for any reason, CloudStack would gradually mark some or all the VMs on the disconnected host as "Stopped" even though it's actually still running on the disconnected VM. When I tried to reconnect the agent, CloudStack seems to instruct the agent to stop the VM first, and will be busy shutting down each of the VMs one by one while in "Connecting" state, before it can obtain "Up" state. This caused all the VMs inside the host (with the disconnected agent) to be down unnecessarily, even though technically they can actually stay up while the agent is reconnecting to the management server. Is there a way we can prevent CloudStack from shutting down the VMs during agent re-connection? Relevant logs from management server and agent are below, it seems HA is the culprit. Any advice is appreciated. Excerpts from management server logs -- on below example, the hostname of the affected VM on the disconnected host is "vm-hostname" and below is the result of grepping "vm-hostname" from the logs. ==== 2016-04-30 23:24:32,680 INFO [cloud.ha.HighAvailabilityManagerImpl] (Timer-1:null) Schedule vm for HA: VM[User|vm-hostname] 2016-04-30 23:24:35,565 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) HA on VM[User|vm-hostname] 2016-04-30 23:24:35,571 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-1:work-11007) Unable to reach the agent for VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with specified id is not in the right state: Disconnected 2016-04-30 23:24:35,571 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) SimpleInvestigator found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,571 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) XenServerInvestigator found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,571 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-1:work-11007) testing if VM[User|vm-hostname] is alive 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-1:work-11007) VM[User|vm-hostname] could not be pinged, returning that it is unknown 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-1:work-11007) Returning null since we're unable to determine state of VM[User|vm-hostname] 2016-04-30 23:24:35,581 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-1:work-11007) Not a System Vm, unable to determine state of VM[User|vm-hostname] returning null 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-1:work-11007) Testing if VM[User|vm-hostname] is alive 2016-04-30 23:24:35,586 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-1:work-11007) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|vm-hostname] returning null 2016-04-30 23:24:35,586 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,588 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) KVMInvestigator found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,592 DEBUG [cloud.ha.KVMFencer] (HA-Worker-1:work-11007) Unable to fence off VM[User|vm-hostname] on Host[-34-Routing] 2016-04-30 23:24:35,592 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) We were unable to fence off the VM VM[User|vm-hostname] 2016-04-30 23:24:35,592 WARN [apache.cloudstack.alerts] (HA-Worker-1:work-11007) alertType:: 8 // dataCenterId:: 6 // podId:: 6 // clusterId:: null // message:: Unable to restart vm-hostname which was running on host name: hypervisor-host(id:34), availability zone: xxxxxxxxxx-Singapore-01, pod: xxxxxxxxxx-Singapore-Pod-01 2016-04-30 23:24:41,028 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentConnectTaskPool-4:null) Both states are Running for VM[User|vm-hostname] ===== The above will keep on looping until a time when CloudStack management server decides to do a force stop as follows: ===== 2016-05-01 00:30:23,305 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) HA on VM[User|vm-hostname] 2016-05-01 00:30:23,311 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-3:work-11249) Unable to reach the agent for VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with specified id is not in the right state: Disconnected 2016-05-01 00:30:23,311 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) SimpleInvestigator found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:23,311 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) XenServerInvestigator found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:23,311 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-3:work-11249) testing if VM[User|vm-hostname] is alive 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-3:work-11249) VM[User|vm-hostname] could not be pinged, returning that it is unknown 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-3:work-11249) Returning null since we're unable to determine state of VM[User|vm-hostname] 2016-05-01 00:30:35,499 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-3:work-11249) Not a System Vm, unable to determine state of VM[User|vm-hostname] returning null 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-3:work-11249) Testing if VM[User|vm-hostname] is alive 2016-05-01 00:30:35,505 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-3:work-11249) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|vm-hostname] returning null 2016-05-01 00:30:35,505 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:35,558 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) KVMInvestigator found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:35,688 WARN [cloud.vm.VirtualMachineManagerImpl] (HA-Worker-3:work-11249) Unable to actually stop VM[User|vm-hostname] but continue with release because it's a force stop 2016-05-01 00:30:35,693 DEBUG [cloud.vm.VirtualMachineManagerImpl] (HA-Worker-3:work-11249) VM[User|vm-hostname] is stopped on the host. Proceeding to release resource held. 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl] (HA-Worker-3:work-11249) Successfully released network resources for the vm VM[User|vm-hostname] 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl] (HA-Worker-3:work-11249) Successfully released storage resources for the vm VM[User|vm-hostname] 2016-05-01 00:31:38,426 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11183) HA on VM[User|vm-hostname] 2016-05-01 00:31:38,426 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11183) VM VM[User|vm-hostname] has been changed. Current State = Stopped Previous State = Running last updated = 113 previous updated = 111 ===== Below are the excerpts from the corresponding agent.log, note that i-1082-3086-VM is the VM ID for the above vm-hostname as example: ===== 2016-04-30 23:24:36,592 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) Detecting a new state but couldn't find a old state so adding it to the changes: i-1082-3086-VM ===== After CloudStack management server decides to mark the VM as stopped, the agent will try to shutdown the VM upon reconnecting of the agent to the management server: ===== 2016-05-01 00:32:32,029 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) Processing command: com.cloud.agent.api.StopCommand 2016-05-01 00:32:32,063 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/vm/network/security_group.py destroy_network_rules_for_vm --vmname i-1082-3086-VM --vif vnet11 2016-05-01 00:32:32,195 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Execution is successful. 2016-05-01 00:32:32,196 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Try to stop the vm at first ===== and ===== 2016-05-01 00:33:04,835 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) successfully shut down vm i-1082-3086-VM 2016-05-01 00:33:04,836 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) Executing: /bin/bash -c ls /sys/class/net/breth1-8/brif | grep vnet 2016-05-01 00:33:04,847 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) Execution is successful. ===== - Is there a way for us to prevent the above scenario from happening? - Is the only way to prevent the above scenario is to disable HA on the VM? - Understand that disabling HA will require applying new service offering for each VM and restart the VM for the changes to take effect. Is there a way to disable HA globally without changing the service offering for each VM? - Is it possible to avoid the above scenario from happening without having to disable HA and losing the HA features and functionality? Any advice is greatly appreciated. Looking forward to your reply, thank you. Cheers. --001a1140eb00c07f680531cb8d36--