Mailing-List: contact users-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@cloudstack.apache.org
MIME-Version: 1.0
Date: Mon, 2 May 2016 01:53:29 +0800
Message-ID: <CAM9eXO5SHB2D30gpyOePV+b4j202i260JhwCbasaxXWnrX6pPg@mail.gmail.com>
Subject: CloudStack agent shuts down VMs upon reconnecting to Management server
From: Indra Pramana <indra@sg.or.id>
To: "users@cloudstack.apache.org" <users@cloudstack.apache.org>
Content-Type: multipart/alternative; boundary=001a1140eb00c07f680531cb8d36
archived-at: Sun, 01 May 2016 17:53:46 -0000

--001a1140eb00c07f680531cb8d36
Content-Type: text/plain; charset=UTF-8

Dear all,

We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. We have
been having a specific problem, which has been happening for quite some
time (may be from the first day we use CloudStack), which we suspect is
related to HA.

When a CloudStack agent gets disconnected from the management server for
any reason, CloudStack would gradually mark some or all the VMs on the
disconnected host as "Stopped" even though it's actually still running on
the disconnected VM. When I tried to reconnect the agent, CloudStack seems
to instruct the agent to stop the VM first, and will be busy shutting down
each of the VMs one by one while in "Connecting" state, before it can
obtain "Up" state.

This caused all the VMs inside the host (with the disconnected agent) to be
down unnecessarily, even though technically they can actually stay up while
the agent is reconnecting to the management server.

Is there a way we can prevent CloudStack from shutting down the VMs during
agent re-connection? Relevant logs from management server and agent are
below, it seems HA is the culprit.

Any advice is appreciated.

Excerpts from management server logs -- on below example, the hostname of
the affected VM on the disconnected host is "vm-hostname" and below is the
result of grepping "vm-hostname" from the logs.

====
2016-04-30 23:24:32,680 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(Timer-1:null) Schedule vm for HA:  VM[User|vm-hostname]
2016-04-30 23:24:35,565 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) HA on VM[User|vm-hostname]
2016-04-30 23:24:35,571 DEBUG [cloud.ha.CheckOnAgentInvestigator]
(HA-Worker-1:work-11007) Unable to reach the agent for
VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
specified id is not in the right state: Disconnected
2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) SimpleInvestigator found VM[User|vm-hostname]to be
alive? null
2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) XenServerInvestigator found VM[User|vm-hostname]to
be alive? null
2016-04-30 23:24:35,571 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-1:work-11007) testing if VM[User|vm-hostname] is alive
2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-1:work-11007) VM[User|vm-hostname] could not be pinged,
returning that it is unknown
2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-1:work-11007) Returning null since we're unable to determine
state of VM[User|vm-hostname]
2016-04-30 23:24:35,581 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-1:work-11007) Not a System Vm, unable to determine state of
VM[User|vm-hostname] returning null
2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-1:work-11007) Testing if VM[User|vm-hostname] is alive
2016-04-30 23:24:35,586 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-1:work-11007) Unable to find a management nic, cannot ping this
system VM, unable to determine state of VM[User|vm-hostname] returning null
2016-04-30 23:24:35,586 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
2016-04-30 23:24:35,588 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) KVMInvestigator found VM[User|vm-hostname]to be
alive? null
2016-04-30 23:24:35,592 DEBUG [cloud.ha.KVMFencer] (HA-Worker-1:work-11007)
Unable to fence off VM[User|vm-hostname] on Host[-34-Routing]
2016-04-30 23:24:35,592 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) We were unable to fence off the VM
VM[User|vm-hostname]
2016-04-30 23:24:35,592 WARN  [apache.cloudstack.alerts]
(HA-Worker-1:work-11007)  alertType:: 8 // dataCenterId:: 6 // podId:: 6 //
clusterId:: null // message:: Unable to restart vm-hostname which was
running on host name: hypervisor-host(id:34), availability zone:
xxxxxxxxxx-Singapore-01, pod: xxxxxxxxxx-Singapore-Pod-01
2016-04-30 23:24:41,028 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(AgentConnectTaskPool-4:null) Both states are Running for
VM[User|vm-hostname]
=====

The above will keep on looping until a time when CloudStack management
server decides to do a force stop as follows:

=====
2016-05-01 00:30:23,305 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) HA on VM[User|vm-hostname]
2016-05-01 00:30:23,311 DEBUG [cloud.ha.CheckOnAgentInvestigator]
(HA-Worker-3:work-11249) Unable to reach the agent for
VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
specified id is not in the right state: Disconnected
2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) SimpleInvestigator found VM[User|vm-hostname]to be
alive? null
2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) XenServerInvestigator found VM[User|vm-hostname]to
be alive? null
2016-05-01 00:30:23,311 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-3:work-11249) testing if VM[User|vm-hostname] is alive
2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-3:work-11249) VM[User|vm-hostname] could not be pinged,
returning that it is unknown
2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-3:work-11249) Returning null since we're unable to determine
state of VM[User|vm-hostname]
2016-05-01 00:30:35,499 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-3:work-11249) Not a System Vm, unable to determine state of
VM[User|vm-hostname] returning null
2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-3:work-11249) Testing if VM[User|vm-hostname] is alive
2016-05-01 00:30:35,505 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-3:work-11249) Unable to find a management nic, cannot ping this
system VM, unable to determine state of VM[User|vm-hostname] returning null
2016-05-01 00:30:35,505 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
2016-05-01 00:30:35,558 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) KVMInvestigator found VM[User|vm-hostname]to be
alive? null
2016-05-01 00:30:35,688 WARN  [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-3:work-11249) Unable to actually stop VM[User|vm-hostname] but
continue with release because it's a force stop
2016-05-01 00:30:35,693 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-3:work-11249) VM[User|vm-hostname] is stopped on the host.
Proceeding to release resource held.
2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-3:work-11249) Successfully released network resources for the vm
VM[User|vm-hostname]
2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-3:work-11249) Successfully released storage resources for the vm
VM[User|vm-hostname]
2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11183) HA on VM[User|vm-hostname]
2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11183) VM VM[User|vm-hostname] has been changed.  Current
State = Stopped Previous State = Running last updated = 113 previous
updated = 111
=====

Below are the excerpts from the corresponding agent.log, note that
i-1082-3086-VM is the VM ID for the above vm-hostname as example:

=====
2016-04-30 23:24:36,592 DEBUG [kvm.resource.LibvirtComputingResource]
(Agent-Handler-1:null) Detecting a new state but couldn't find a old state
so adding it to the changes: i-1082-3086-VM
=====

After CloudStack management server decides to mark the VM as stopped, the
agent will try to shutdown the VM upon reconnecting of the agent to the
management server:

=====
2016-05-01 00:32:32,029 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-3:null) Processing command:
com.cloud.agent.api.StopCommand
2016-05-01 00:32:32,063 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) Executing:
/usr/share/cloudstack-common/scripts/vm/network/security_group.py
destroy_network_rules_for_vm --vmname i-1082-3086-VM --vif vnet11
2016-05-01 00:32:32,195 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) Execution is successful.
2016-05-01 00:32:32,196 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) Try to stop the vm at first
=====

and

=====
2016-05-01 00:33:04,835 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) successfully shut down vm i-1082-3086-VM
2016-05-01 00:33:04,836 DEBUG [utils.script.Script]
(agentRequest-Handler-3:null) Executing: /bin/bash -c ls
/sys/class/net/breth1-8/brif | grep vnet
2016-05-01 00:33:04,847 DEBUG [utils.script.Script]
(agentRequest-Handler-3:null) Execution is successful.
=====

- Is there a way for us to prevent the above scenario from happening?
- Is the only way to prevent the above scenario is to disable HA on the VM?
- Understand that disabling HA will require applying new service offering
for each VM and restart the VM for the changes to take effect. Is there a
way to disable HA globally without changing the service offering for each
VM?
- Is it possible to avoid the above scenario from happening without having
to disable HA and losing the HA features and functionality?

Any advice is greatly appreciated.

Looking forward to your reply, thank you.

Cheers.

--001a1140eb00c07f680531cb8d36--