cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Indra Pramana <in...@sg.or.id>
Subject Re: CloudStack agent shuts down VMs upon reconnecting to Management server
Date Mon, 02 May 2016 01:42:44 GMT
Dear all,

I received an advice from a nice guy on the IRC channel to increase the HA
timer, which I suppose is the time period when HA workers are started upon
disconnection of a host.However, I can't seem to find the settings on the
CloudStack's global settings. Anyone knows how to set this up? I can only
find these two settings related to HA on global settings:

ha.tagHA tag defining that the host marked with this tag can be used for HA
purposes only

ha.workersNumber of ha worker threads.5


I also noted that the default value of ha.workers is set to 5, and I can
actually set to 0 -- can I prevent the HA workers to be started by doing
this, and will there be any impact to the overall CloudStack operations?
Looking into setting this temporarily until I can find the more proper and
best solution to the problem.

Looking forward to your reply, thank you.

Cheers.


On Mon, May 2, 2016 at 1:53 AM, Indra Pramana <indra@sg.or.id> wrote:

> Dear all,
>
> We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. We
> have been having a specific problem, which has been happening for quite
> some time (may be from the first day we use CloudStack), which we suspect
> is related to HA.
>
> When a CloudStack agent gets disconnected from the management server for
> any reason, CloudStack would gradually mark some or all the VMs on the
> disconnected host as "Stopped" even though it's actually still running on
> the disconnected VM. When I tried to reconnect the agent, CloudStack seems
> to instruct the agent to stop the VM first, and will be busy shutting down
> each of the VMs one by one while in "Connecting" state, before it can
> obtain "Up" state.
>
> This caused all the VMs inside the host (with the disconnected agent) to
> be down unnecessarily, even though technically they can actually stay up
> while the agent is reconnecting to the management server.
>
> Is there a way we can prevent CloudStack from shutting down the VMs during
> agent re-connection? Relevant logs from management server and agent are
> below, it seems HA is the culprit.
>
> Any advice is appreciated.
>
> Excerpts from management server logs -- on below example, the hostname of
> the affected VM on the disconnected host is "vm-hostname" and below is the
> result of grepping "vm-hostname" from the logs.
>
> ====
> 2016-04-30 23:24:32,680 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (Timer-1:null) Schedule vm for HA:  VM[User|vm-hostname]
> 2016-04-30 23:24:35,565 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) HA on VM[User|vm-hostname]
> 2016-04-30 23:24:35,571 DEBUG [cloud.ha.CheckOnAgentInvestigator]
> (HA-Worker-1:work-11007) Unable to reach the agent for
> VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
> specified id is not in the right state: Disconnected
> 2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) SimpleInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) XenServerInvestigator found VM[User|vm-hostname]to
> be alive? null
> 2016-04-30 23:24:35,571 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) testing if VM[User|vm-hostname] is alive
> 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) VM[User|vm-hostname] could not be pinged,
> returning that it is unknown
> 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) Returning null since we're unable to determine
> state of VM[User|vm-hostname]
> 2016-04-30 23:24:35,581 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
> 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Not a System Vm, unable to determine state of
> VM[User|vm-hostname] returning null
> 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Testing if VM[User|vm-hostname] is alive
> 2016-04-30 23:24:35,586 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Unable to find a management nic, cannot ping this
> system VM, unable to determine state of VM[User|vm-hostname] returning null
> 2016-04-30 23:24:35,586 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
> 2016-04-30 23:24:35,588 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) KVMInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-04-30 23:24:35,592 DEBUG [cloud.ha.KVMFencer]
> (HA-Worker-1:work-11007) Unable to fence off VM[User|vm-hostname] on
> Host[-34-Routing]
> 2016-04-30 23:24:35,592 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) We were unable to fence off the VM
> VM[User|vm-hostname]
> 2016-04-30 23:24:35,592 WARN  [apache.cloudstack.alerts]
> (HA-Worker-1:work-11007)  alertType:: 8 // dataCenterId:: 6 // podId:: 6 //
> clusterId:: null // message:: Unable to restart vm-hostname which was
> running on host name: hypervisor-host(id:34), availability zone:
> xxxxxxxxxx-Singapore-01, pod: xxxxxxxxxx-Singapore-Pod-01
> 2016-04-30 23:24:41,028 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (AgentConnectTaskPool-4:null) Both states are Running for
> VM[User|vm-hostname]
> =====
>
> The above will keep on looping until a time when CloudStack management
> server decides to do a force stop as follows:
>
> =====
> 2016-05-01 00:30:23,305 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) HA on VM[User|vm-hostname]
> 2016-05-01 00:30:23,311 DEBUG [cloud.ha.CheckOnAgentInvestigator]
> (HA-Worker-3:work-11249) Unable to reach the agent for
> VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
> specified id is not in the right state: Disconnected
> 2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) SimpleInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) XenServerInvestigator found VM[User|vm-hostname]to
> be alive? null
> 2016-05-01 00:30:23,311 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) testing if VM[User|vm-hostname] is alive
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) VM[User|vm-hostname] could not be pinged,
> returning that it is unknown
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) Returning null since we're unable to determine
> state of VM[User|vm-hostname]
> 2016-05-01 00:30:35,499 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Not a System Vm, unable to determine state of
> VM[User|vm-hostname] returning null
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Testing if VM[User|vm-hostname] is alive
> 2016-05-01 00:30:35,505 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Unable to find a management nic, cannot ping this
> system VM, unable to determine state of VM[User|vm-hostname] returning null
> 2016-05-01 00:30:35,505 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
> 2016-05-01 00:30:35,558 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) KVMInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-05-01 00:30:35,688 WARN  [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Unable to actually stop VM[User|vm-hostname] but
> continue with release because it's a force stop
> 2016-05-01 00:30:35,693 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) VM[User|vm-hostname] is stopped on the host.
> Proceeding to release resource held.
> 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Successfully released network resources for the vm
> VM[User|vm-hostname]
> 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Successfully released storage resources for the vm
> VM[User|vm-hostname]
> 2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11183) HA on VM[User|vm-hostname]
> 2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11183) VM VM[User|vm-hostname] has been changed.  Current
> State = Stopped Previous State = Running last updated = 113 previous
> updated = 111
> =====
>
> Below are the excerpts from the corresponding agent.log, note that
> i-1082-3086-VM is the VM ID for the above vm-hostname as example:
>
> =====
> 2016-04-30 23:24:36,592 DEBUG [kvm.resource.LibvirtComputingResource]
> (Agent-Handler-1:null) Detecting a new state but couldn't find a old state
> so adding it to the changes: i-1082-3086-VM
> =====
>
> After CloudStack management server decides to mark the VM as stopped, the
> agent will try to shutdown the VM upon reconnecting of the agent to the
> management server:
>
> =====
> 2016-05-01 00:32:32,029 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Processing command:
> com.cloud.agent.api.StopCommand
> 2016-05-01 00:32:32,063 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Executing:
> /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> destroy_network_rules_for_vm --vmname i-1082-3086-VM --vif vnet11
> 2016-05-01 00:32:32,195 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Execution is successful.
> 2016-05-01 00:32:32,196 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Try to stop the vm at first
> =====
>
> and
>
> =====
> 2016-05-01 00:33:04,835 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) successfully shut down vm i-1082-3086-VM
> 2016-05-01 00:33:04,836 DEBUG [utils.script.Script]
> (agentRequest-Handler-3:null) Executing: /bin/bash -c ls
> /sys/class/net/breth1-8/brif | grep vnet
> 2016-05-01 00:33:04,847 DEBUG [utils.script.Script]
> (agentRequest-Handler-3:null) Execution is successful.
> =====
>
> - Is there a way for us to prevent the above scenario from happening?
> - Is the only way to prevent the above scenario is to disable HA on the VM?
> - Understand that disabling HA will require applying new service offering
> for each VM and restart the VM for the changes to take effect. Is there a
> way to disable HA globally without changing the service offering for each
> VM?
> - Is it possible to avoid the above scenario from happening without having
> to disable HA and losing the HA features and functionality?
>
> Any advice is greatly appreciated.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message