Dear all, I received an advice from a nice guy on the IRC channel to increase the HA timer, which I suppose is the time period when HA workers are started upon disconnection of a host.However, I can't seem to find the settings on the CloudStack's global settings. Anyone knows how to set this up? I can only find these two settings related to HA on global settings: ha.tagHA tag defining that the host marked with this tag can be used for HA purposes only ha.workersNumber of ha worker threads.5 I also noted that the default value of ha.workers is set to 5, and I can actually set to 0 -- can I prevent the HA workers to be started by doing this, and will there be any impact to the overall CloudStack operations? Looking into setting this temporarily until I can find the more proper and best solution to the problem. Looking forward to your reply, thank you. Cheers. On Mon, May 2, 2016 at 1:53 AM, Indra Pramana wrote: > Dear all, > > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. We > have been having a specific problem, which has been happening for quite > some time (may be from the first day we use CloudStack), which we suspect > is related to HA. > > When a CloudStack agent gets disconnected from the management server for > any reason, CloudStack would gradually mark some or all the VMs on the > disconnected host as "Stopped" even though it's actually still running on > the disconnected VM. When I tried to reconnect the agent, CloudStack seems > to instruct the agent to stop the VM first, and will be busy shutting down > each of the VMs one by one while in "Connecting" state, before it can > obtain "Up" state. > > This caused all the VMs inside the host (with the disconnected agent) to > be down unnecessarily, even though technically they can actually stay up > while the agent is reconnecting to the management server. > > Is there a way we can prevent CloudStack from shutting down the VMs during > agent re-connection? Relevant logs from management server and agent are > below, it seems HA is the culprit. > > Any advice is appreciated. > > Excerpts from management server logs -- on below example, the hostname of > the affected VM on the disconnected host is "vm-hostname" and below is the > result of grepping "vm-hostname" from the logs. > > ==== > 2016-04-30 23:24:32,680 INFO [cloud.ha.HighAvailabilityManagerImpl] > (Timer-1:null) Schedule vm for HA: VM[User|vm-hostname] > 2016-04-30 23:24:35,565 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-1:work-11007) HA on VM[User|vm-hostname] > 2016-04-30 23:24:35,571 DEBUG [cloud.ha.CheckOnAgentInvestigator] > (HA-Worker-1:work-11007) Unable to reach the agent for > VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with > specified id is not in the right state: Disconnected > 2016-04-30 23:24:35,571 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-1:work-11007) SimpleInvestigator found VM[User|vm-hostname]to be > alive? null > 2016-04-30 23:24:35,571 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-1:work-11007) XenServerInvestigator found VM[User|vm-hostname]to > be alive? null > 2016-04-30 23:24:35,571 DEBUG [cloud.ha.UserVmDomRInvestigator] > (HA-Worker-1:work-11007) testing if VM[User|vm-hostname] is alive > 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator] > (HA-Worker-1:work-11007) VM[User|vm-hostname] could not be pinged, > returning that it is unknown > 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator] > (HA-Worker-1:work-11007) Returning null since we're unable to determine > state of VM[User|vm-hostname] > 2016-04-30 23:24:35,581 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null > 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] > (HA-Worker-1:work-11007) Not a System Vm, unable to determine state of > VM[User|vm-hostname] returning null > 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] > (HA-Worker-1:work-11007) Testing if VM[User|vm-hostname] is alive > 2016-04-30 23:24:35,586 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] > (HA-Worker-1:work-11007) Unable to find a management nic, cannot ping this > system VM, unable to determine state of VM[User|vm-hostname] returning null > 2016-04-30 23:24:35,586 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null > 2016-04-30 23:24:35,588 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-1:work-11007) KVMInvestigator found VM[User|vm-hostname]to be > alive? null > 2016-04-30 23:24:35,592 DEBUG [cloud.ha.KVMFencer] > (HA-Worker-1:work-11007) Unable to fence off VM[User|vm-hostname] on > Host[-34-Routing] > 2016-04-30 23:24:35,592 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-1:work-11007) We were unable to fence off the VM > VM[User|vm-hostname] > 2016-04-30 23:24:35,592 WARN [apache.cloudstack.alerts] > (HA-Worker-1:work-11007) alertType:: 8 // dataCenterId:: 6 // podId:: 6 // > clusterId:: null // message:: Unable to restart vm-hostname which was > running on host name: hypervisor-host(id:34), availability zone: > xxxxxxxxxx-Singapore-01, pod: xxxxxxxxxx-Singapore-Pod-01 > 2016-04-30 23:24:41,028 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (AgentConnectTaskPool-4:null) Both states are Running for > VM[User|vm-hostname] > ===== > > The above will keep on looping until a time when CloudStack management > server decides to do a force stop as follows: > > ===== > 2016-05-01 00:30:23,305 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-3:work-11249) HA on VM[User|vm-hostname] > 2016-05-01 00:30:23,311 DEBUG [cloud.ha.CheckOnAgentInvestigator] > (HA-Worker-3:work-11249) Unable to reach the agent for > VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with > specified id is not in the right state: Disconnected > 2016-05-01 00:30:23,311 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-3:work-11249) SimpleInvestigator found VM[User|vm-hostname]to be > alive? null > 2016-05-01 00:30:23,311 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-3:work-11249) XenServerInvestigator found VM[User|vm-hostname]to > be alive? null > 2016-05-01 00:30:23,311 DEBUG [cloud.ha.UserVmDomRInvestigator] > (HA-Worker-3:work-11249) testing if VM[User|vm-hostname] is alive > 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator] > (HA-Worker-3:work-11249) VM[User|vm-hostname] could not be pinged, > returning that it is unknown > 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator] > (HA-Worker-3:work-11249) Returning null since we're unable to determine > state of VM[User|vm-hostname] > 2016-05-01 00:30:35,499 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null > 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] > (HA-Worker-3:work-11249) Not a System Vm, unable to determine state of > VM[User|vm-hostname] returning null > 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] > (HA-Worker-3:work-11249) Testing if VM[User|vm-hostname] is alive > 2016-05-01 00:30:35,505 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] > (HA-Worker-3:work-11249) Unable to find a management nic, cannot ping this > system VM, unable to determine state of VM[User|vm-hostname] returning null > 2016-05-01 00:30:35,505 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null > 2016-05-01 00:30:35,558 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-3:work-11249) KVMInvestigator found VM[User|vm-hostname]to be > alive? null > 2016-05-01 00:30:35,688 WARN [cloud.vm.VirtualMachineManagerImpl] > (HA-Worker-3:work-11249) Unable to actually stop VM[User|vm-hostname] but > continue with release because it's a force stop > 2016-05-01 00:30:35,693 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (HA-Worker-3:work-11249) VM[User|vm-hostname] is stopped on the host. > Proceeding to release resource held. > 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (HA-Worker-3:work-11249) Successfully released network resources for the vm > VM[User|vm-hostname] > 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (HA-Worker-3:work-11249) Successfully released storage resources for the vm > VM[User|vm-hostname] > 2016-05-01 00:31:38,426 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-3:work-11183) HA on VM[User|vm-hostname] > 2016-05-01 00:31:38,426 INFO [cloud.ha.HighAvailabilityManagerImpl] > (HA-Worker-3:work-11183) VM VM[User|vm-hostname] has been changed. Current > State = Stopped Previous State = Running last updated = 113 previous > updated = 111 > ===== > > Below are the excerpts from the corresponding agent.log, note that > i-1082-3086-VM is the VM ID for the above vm-hostname as example: > > ===== > 2016-04-30 23:24:36,592 DEBUG [kvm.resource.LibvirtComputingResource] > (Agent-Handler-1:null) Detecting a new state but couldn't find a old state > so adding it to the changes: i-1082-3086-VM > ===== > > After CloudStack management server decides to mark the VM as stopped, the > agent will try to shutdown the VM upon reconnecting of the agent to the > management server: > > ===== > 2016-05-01 00:32:32,029 DEBUG [cloud.agent.Agent] > (agentRequest-Handler-3:null) Processing command: > com.cloud.agent.api.StopCommand > 2016-05-01 00:32:32,063 DEBUG [kvm.resource.LibvirtComputingResource] > (agentRequest-Handler-3:null) Executing: > /usr/share/cloudstack-common/scripts/vm/network/security_group.py > destroy_network_rules_for_vm --vmname i-1082-3086-VM --vif vnet11 > 2016-05-01 00:32:32,195 DEBUG [kvm.resource.LibvirtComputingResource] > (agentRequest-Handler-3:null) Execution is successful. > 2016-05-01 00:32:32,196 DEBUG [kvm.resource.LibvirtComputingResource] > (agentRequest-Handler-3:null) Try to stop the vm at first > ===== > > and > > ===== > 2016-05-01 00:33:04,835 DEBUG [kvm.resource.LibvirtComputingResource] > (agentRequest-Handler-3:null) successfully shut down vm i-1082-3086-VM > 2016-05-01 00:33:04,836 DEBUG [utils.script.Script] > (agentRequest-Handler-3:null) Executing: /bin/bash -c ls > /sys/class/net/breth1-8/brif | grep vnet > 2016-05-01 00:33:04,847 DEBUG [utils.script.Script] > (agentRequest-Handler-3:null) Execution is successful. > ===== > > - Is there a way for us to prevent the above scenario from happening? > - Is the only way to prevent the above scenario is to disable HA on the VM? > - Understand that disabling HA will require applying new service offering > for each VM and restart the VM for the changes to take effect. Is there a > way to disable HA globally without changing the service offering for each > VM? > - Is it possible to avoid the above scenario from happening without having > to disable HA and losing the HA features and functionality? > > Any advice is greatly appreciated. > > Looking forward to your reply, thank you. > > Cheers. >