cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From marcaurele <...@git.apache.org>
Subject [GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Date Wed, 17 Aug 2016 15:14:17 GMT
Github user marcaurele commented on the issue:

    https://github.com/apache/cloudstack/pull/1640
  
    I'll do the long explanation regarding the bug in production we had.
    
    Twice we had the VPN going down between 2 of our zones which resulted in the lost of the
communication between the management server and all the agents of the other zone. After a
few minutes the network went back and as the agents were reconnecting we started to see VMs
being shutdown, even though we don't have HA enabled. We have 2 management servers in front
of haproxy to balance agents & requests. When the network went back, the agent reconnected
to the other management server 01. The server 02 had already started to issue some commands
to shutdown the VM.
    
    The important lines are here, I kept the line showing one VM being shutdown `111750`:
    
    ```
    srv02 2016-08-08 11:56:03,854 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
PingInvestigator was able to determine host 44692 is in Disconnected
    srv02 2016-08-08 11:56:03,855 WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Agent is disconnected but the host is still up: 44692-virt-hv041
    srv02 2016-08-08 11:56:03,857 WARN  [apache.cloudstack.alerts] (AgentTaskPool-16:ctx-8b5b6956)
 alertType:: 7 // dataCenterId:: 2 // podId:: 2 // clusterId:: null // message:: Host disconnected,
name: virt-hv041 (id:44692), availability zone: ch-dk-2, pod: DK2-AZ1-POD01
    srv02 2016-08-08 11:56:03,884 INFO  [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Host 44692 is disconnecting with event AgentDisconnected
    srv02 2016-08-08 11:56:03,895 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
The next status of agent 44692is Alert, current status is Up
    srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Deregistering link for 44692 with state Alert
    srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Remove Agent : 44692
    srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956)
Seq 44692-4932286016900831481: Sending disconnect to class com.cloud.network.security.SecurityGroupListener
    srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956)
Seq 44692-4932286016900834802: Sending disconnect to class com.cloud.network.security.SecurityGroupListener
    srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956)
Seq 44692-4932286016900886805: Sending disconnect to class com.cloud.network.security.SecurityGroupListener
    ...
    srv02 2016-08-08 11:56:03,979 DEBUG [cloud.network.NetworkUsageManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Disconnected called on 44692 with status Alert
    srv02 2016-08-08 11:56:03,987 DEBUG [cloud.host.Status] (AgentTaskPool-16:ctx-8b5b6956)
Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 44692, name
= virt-hv041]
    srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Forwarding [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
to 345049010805
    srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Worker-2:ctx-e087e71f)
Cluster PDU 90520739220960 -> 345049010805. agent: 44692, pdu seq: 15, pdu ack seq: 0,
json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
    srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4)
Dispatch ->44692, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
    srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4)
Intercepting command for agent change: agent 44692 event: AgentDisconnected
    srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4)
Received agent disconnect event for host 44692
    srv02 2016-08-08 11:56:04,002 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Worker-2:ctx-e087e71f)
Cluster PDU 90520739220960 -> 345049010805 completed. time: 3ms. agent: 44692, pdu seq:
15, pdu ack seq: 0, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
    srv01 2016-08-08 11:56:04,004 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4)
Not processing AgentDisconnected event for the host id=44692 as the host is directly connected
to the current management server 345049010805
    srv02 2016-08-08 11:56:04,004 WARN  [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Scheduling restart for VMs on host 44692-virt-hv041
    ...
    srv02 2016-08-08 11:56:04,286 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Notifying HA Mgr of to restart vm 111750-i-8114-111750-VM
    srv02 2016-08-08 11:56:04,296 INFO  [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956)
Schedule vm for HA:  VM[User|i-8114-111750-VM]
    ...
    srv01 2016-08-08 11:56:17,394 DEBUG [agent.manager.AgentManagerImpl] (StatsCollector-4:ctx-dcd63254)
Can not send command com.cloud.agent.api.GetVmStatsCommand due to Host 44692 is not up
    srv02 2016-08-08 11:56:17,395 DEBUG [agent.manager.AgentManagerImpl] (StatsCollector-3:ctx-8cb3d7fb)
Can not send command com.cloud.agent.api.GetVmStatsCommand due to Host 44692 is not up
    srv01 2016-08-08 11:56:37,825 DEBUG [agent.manager.AgentManagerImpl] (AgentManager-Handler-13:null)
Ping from 44692
    srv01 2016-08-08 11:56:37,826 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null)
Process host VM state report from ping process. host: 44692
    srv01 2016-08-08 11:56:37,886 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null)
Process VM state report. host: 44692, number of records in report: 26
    srv01 2016-08-08 11:56:37,891 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null)
VM state report is updated. host: 44692, vm id: 81688, power state: PowerOn
    ...
    srv01 2016-08-08 11:57:38,581 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-1:null)
VM state report is updated. host: 44692, vm id: 111750, power state: PowerOn
    ...
    srv01 2016-08-08 12:00:50,544 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881
work-858) Processing HAWork[858-HA-111750-Running-Investigating]
    srv01 2016-08-08 12:00:50,555 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881
work-858) HA on VM[User|i-8114-111750-VM]
    srv01 2016-08-08 12:00:50,565 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-f9c5c881
work-858) Seq 44692-6380193297100439580: Sending  { Cmd , MgmtId: 345049010805, via: 44692(virt-hv041),
Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckVirtualMachineCommand":{"vmName":"i-8114-111750-VM","wait":20}}]
}
    ...
    srv01 2016-08-08 12:00:50,658 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-f9c5c881
work-858) Seq 44692-6380193297100439580: Received:  { Ans: , MgmtId: 345049010805, via: 44692,
Ver: v1, Flags: 10, { CheckVirtualMachineAnswer } }
    srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-2:ctx-f9c5c881
work-858) Agent responded with state Running
    srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881
work-858) Rescheduling because the host is not up but the vm is alive
    srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881
work-858) VM i-8114-111750-VM is found to be alive by SimpleInvestigator
    srv01 2016-08-08 12:00:50,658 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881
work-858) Rescheduling HAWork[858-HA-111750-Running-Investigating] to try again at Mon Aug
08 12:01:51 CEST 2016
    srv01 2016-08-08 12:00:50,658 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881
work-858) SimpleInvestigator found VM[User|i-8114-111750-VM]to be alive? true
    ...
    srv01 2016-08-08 12:02:35,374 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22
work-858) Processing HAWork[858-HA-111750-Running-Investigating]
    srv01 2016-08-08 12:02:35,381 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22
work-858) HA on VM[User|i-8114-111750-VM]
    ...
    srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-4:ctx-ee61de22
work-858) Agent responded with state Running
    srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22
work-858) Rescheduling because the host is not up but the vm is alive
    srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22
work-858) VM i-8114-111750-VM is found to be alive by SimpleInvestigator
    srv01 2016-08-08 12:02:35,478 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22
work-858) Rescheduling HAWork[858-HA-111750-Running-Investigating] to try again at Mon Aug
08 12:03:36 CEST 2016
    srv01 2016-08-08 12:02:35,478 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22
work-858) SimpleInvestigator found VM[User|i-8114-111750-VM]to be alive? true
    ...
    srv02 2016-08-08 12:03:46,230 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) Processing HAWork[858-HA-111750-Running-Investigating]
    srv02 2016-08-08 12:03:46,237 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) HA on VM[User|i-8114-111750-VM]
    srv02 2016-08-08 12:03:46,246 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-2:ctx-11e5dfc9
work-858) Unable to reach the agent for VM[User|i-8114-111750-VM]: Resource [Host:44692] is
unreachable: Host 44692: Host with specified id is not in the right state: Alert
    srv02 2016-08-08 12:03:46,246 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-2:ctx-11e5dfc9
work-858) testing if VM[User|i-8114-111750-VM] is alive
    srv02 2016-08-08 12:03:46,246 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) SimpleInvestigator found VM[User|i-8114-111750-VM]to be alive? null
    srv02 2016-08-08 12:03:46,246 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) XenServerInvestigator found VM[User|i-8114-111750-VM]to be alive? null
    srv02 2016-08-08 12:03:46,259 DEBUG [agent.manager.AgentManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) Host with id null doesn't exist
    srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9
work-858) Not a System Vm, unable to determine state of VM[User|i-8114-111750-VM] returning
null
    srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9
work-858) Testing if VM[User|i-8114-111750-VM] is alive
    srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-2:ctx-11e5dfc9
work-858) Returning null since we're unable to determine state of VM[User|i-8114-111750-VM]
    srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-2:ctx-11e5dfc9
work-858) VM[User|i-8114-111750-VM] could not be pinged, returning that it is unknown
    srv02 2016-08-08 12:03:46,259 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) PingInvestigator found VM[User|i-8114-111750-VM]to be alive? null
    srv01 2016-08-08 12:03:46,261 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-e1a90562
work-859) HA on VM[User|i-10355-116755-VM]
    srv02 2016-08-08 12:03:46,264 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9
work-858) Unable to find a management nic, cannot ping this system VM, unable to determine
state of VM[User|i-8114-111750-VM] returning null
    srv02 2016-08-08 12:03:46,265 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) ManagementIPSysVMInvestigator found VM[User|i-8114-111750-VM]to be alive? null
    srv01 2016-08-08 12:03:46,266 DEBUG [agent.transport.Request] (HA-Worker-3:ctx-e1a90562
work-859) Seq 44680-8591460714139353168: Sending  { Cmd , MgmtId: 345049010805, via: 44680(virt-hv029),
Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckVirtualMachineCommand":{"vmName":"i-10355-116755-VM","wait":20}}]
}
    srv02 2016-08-08 12:03:46,272 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-11e5dfc9
work-858) Seq 44672-2138365398070947852: Sending  { Cmd , MgmtId: 90520739220960, via: 44672(virt-hv019),
Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"eea983e8-6156-4ddb-b5c5-e4c4efa9ff4b-LibvirtComputingResource","privateNetwork":{"ip":"....","netmask":"....","mac":"....","isSecurityGroupEnabled":false},"storageNetwork1":{"ip":"....","netmask":"....","mac":"....","isSecurityGroupEnabled":false}},"wait":20}}]
}
    srv02 2016-08-08 12:03:46,327 DEBUG [agent.transport.Request] (AgentManager-Handler-3:null)
Seq 44672-2138365398070947852: Processing:  { Ans: , MgmtId: 90520739220960, via: 44672, Ver:
v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
    srv02 2016-08-08 12:03:46,328 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-11e5dfc9
work-858) Seq 44672-2138365398070947852: Received:  { Ans: , MgmtId: 90520739220960, via:
44672, Ver: v1, Flags: 10, { Answer } }
    srv02 2016-08-08 12:03:46,328 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) Fencing off VM that we don't know the state of
    srv02 2016-08-08 12:03:46,328 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9
work-858) KVMInvestigator found VM[User|i-8114-111750-VM]to be alive? null
    srv01 2016-08-08 12:03:46,333 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:ctx-bc9b35f1)
No need to calibrate cpu capacity, host:44687 usedCpu: 0 reservedCpu: 83524
    srv01 2016-08-08 12:03:46,333 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:ctx-bc9b35f1)
No need to calibrate memory capacity, host:44687 usedMem: 0 reservedMem: 44023414784
    srv02 2016-08-08 12:03:46,333 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-11e5dfc9
work-858) Seq 44672-2138365398070947853: Sending  { Cmd , MgmtId: 90520739220960, via: 44672(virt-hv019),
Ver: v1, Flags: 100011, [{"com.cloud.agent.api.FenceCommand":{"vmName":"i-8114-111750-VM","hostGuid":"eea983e8-6156-4ddb-b5c5-e4c4efa9ff4b-LibvirtComputingResource","hostIp":"....","inSeq":false,"wait":0}}]
}
    
    ```
    
    The way I reproduced it in our lab is my dropping the traffic from one hypervisor to/from
the management servers:
    ```
    sudo iptables -I OUTPUT -d 10.16.13.83 -j DROP
    sudo iptables -I INPUT -s 10.16.13.83 -j DROP
    ```
    
    When you see the host changing its state to Alert, restart the agent so that it will send
quickly a `startupCommand` and re-enable the traffic if you want to see the VMs being shutdown.

    ```
    sudo service cloudstack-agent restart
    sudo iptables -D OUTPUT -d 10.16.13.83 -j DROP
    sudo iptables -D INPUT -s 10.16.13.83 -j DROP
    ```
    
    During that time window (right after the host/agent turned in Alert) you'll start to see
AsyncJobs coming in the database to stop VMs, which is the thing I'm requesting a change for.
If the agent is disconnected for enough time the commands will vanished since the host cannot
be reached. But if the timing is just *bad*, the agent is back online while the jobs are being
created, and it will process some of those commands. Playing with the `ping.timeout` value
would only move the window of the bug in another direction.
    
    It's not too easy to make it happen but feasible.
    
    The problem could be handle differently, but I found this way a good solution without
adding too much complexity in the HA worker tasks, like checking between investigation the
state of the host. IMO if the VM are not HA, then nothing should be done: **the fencing should
not be started** and let the powerState report change the VMs state when it could reach back
the agent on the host. Currently the fencing is being done(and the restart is skipped since
a check for HA is done before).
    
    We are running a custom version based on 4.4.2 with many merges, but as I could see the
bug still present on master, I pushed this PR.
    
    Hope this helps you better understand the problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message