cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-8848) Unexpected VR reboot after out-of-band migration
Date Wed, 16 Sep 2015 08:15:45 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14747118#comment-14747118
] 

ASF GitHub Bot commented on CLOUDSTACK-8848:
--------------------------------------------

Github user anshul1886 commented on the pull request:

    https://github.com/apache/cloudstack/pull/829#issuecomment-140665483
  
    Currently we report only PowerOn VMs and do not report PowerOff VMs that's why we consider
Missing and PowerOff as same And that's how whole code is written and each Hypervisor resource
has same understanding. This will effect HA and many more unknown places. So please do not
even consider to merge this change.
    
    So Now coming to bug we can fix that by changing global setting pingInterval to appropriate
value according to hypervisor settings which takes care of these transitional period of missing
report here or can be handled by introducing gracePeriod global setting.


> Unexpected VR reboot after out-of-band migration
> ------------------------------------------------
>
>                 Key: CLOUDSTACK-8848
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8848
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: VMware
>    Affects Versions: 4.5.2, 4.6.0
>            Reporter: René Moser
>            Priority: Blocker
>             Fix For: 4.5.3, 4.6.0
>
>
> In some conditions (race condition), VR gets rebooted after a out of band migration was
done on vCenter. 
> {panel:bgColor=#FFFFCE}
> Note, new global setting in 4.5.2 "VR reboot after out of band migration" is set to *false*
and this looks more like a bug.
> {panel}
> After a VR migration to a host _and_ when the VM power state report gathering is running,
the VR (and also any user VM as well) will get into the "PowerReportMissing".
> If the VM is a VR. it will be powered off and started again on vCenter. That is what
we see. In can not be reproduced every time a migration was done, but it seems the problem
is related to "powerReportMissing".
> I grep-ed the source and found this line related
> https://github.com/apache/cloudstack/blob/4.5.2/engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java#L3616
> and also it seems that the graceful period might be also related, 
> https://github.com/apache/cloudstack/blob/4.5.2/engine/orchestration/src/com/cloud/vm/VirtualMachinePowerStateSyncImpl.java#L110
> In case it is a user VM, we see in the logs that the state will be set to powered-off,
but the VM keeps running. After a while a new VM power state report is running and the state
for the user VM gets updated to Running again. Below the logs 
> h2. VR  r-342-VM reboot log
> {code:none}
> 2015-09-15 09:37:06,508 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-253:ctx-c4f59216)
Run missing VM report. current time: 1442302626508
> 2015-09-15 09:37:06,508 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-253:ctx-c4f59216)
Detected missing VM. host: 19, vm id: 342, power state: PowerReportMissing, last state update:
1442302506000
> 2015-09-15 09:37:06,508 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-253:ctx-c4f59216)
vm id: 342 - time since last state update(120508ms) has passed graceful period
> 2015-09-15 09:37:06,517 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-253:ctx-c4f59216)
VM state report is updated. host: 19, vm id: 342, power state: PowerReportMissing 
> 2015-09-15 09:37:06,525 INFO  [c.c.v.VirtualMachineManagerImpl] (DirectAgentCronJob-253:ctx-c4f59216)
VM r-342-VM is at Running and we received a power-off report while there is no pending jobs
on it
> 2015-09-15 09:37:06,532 DEBUG [c.c.a.t.Request] (DirectAgentCronJob-253:ctx-c4f59216)
Seq 19-4511199451741686482: Sending  { Cmd , MgmtId: 345051122106, via: 19(cu01-testpod01-esx03.stxt.media.int),
Ver: v1, Flags: 100011, [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"r-342-VM","wait":0}}]
}
> 2015-09-15 09:37:06,532 DEBUG [c.c.a.t.Request] (DirectAgentCronJob-253:ctx-c4f59216)
Seq 19-4511199451741686482: Executing:  { Cmd , MgmtId: 345051122106, via: 19(cu01-testpod01-esx03.stxt.media.int),
Ver: v1, Flags: 100011, [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"r-342-VM","wait":0}}]
}
> 2015-09-15 09:37:06,532 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgent-136:ctx-9bc0a401)
Seq 19-4511199451741686482: Executing request
> 2015-09-15 09:37:06,532 INFO  [c.c.h.v.r.VmwareResource] (DirectAgent-136:ctx-9bc0a401
cu01-testpod01-esx03.stxt.media.int, cmd: StopCommand) Executing resource StopCommand: {"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"r-342-VM","wait":0}
> 2015-09-15 09:37:06,551 DEBUG [c.c.h.v.m.HostMO] (DirectAgent-136:ctx-9bc0a401 cu01-testpod01-esx03.stxt.media.int,
cmd: StopCommand) find VM r-342-VM on host
> 2015-09-15 09:37:06,551 INFO  [c.c.h.v.m.HostMO] (DirectAgent-136:ctx-9bc0a401 cu01-testpod01-esx03.stxt.media.int,
cmd: StopCommand) VM r-342-VM not found in host cache
> 2015-09-15 09:37:06,551 DEBUG [c.c.h.v.m.HostMO] (DirectAgent-136:ctx-9bc0a401 cu01-testpod01-esx03.stxt.media.int,
cmd: StopCommand) load VM cache on host
> 2015-09-15 09:37:06,577 INFO  [c.c.h.v.r.VmwareResource] (DirectAgent-136:ctx-9bc0a401
cu01-testpod01-esx03.stxt.media.int, cmd: StopCommand) VM r-342-VM is no longer in vSphere
> 2015-09-15 09:37:06,578 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgent-136:ctx-9bc0a401)
Seq 19-4511199451741686482: Response Received: 
> 2015-09-15 09:37:06,578 DEBUG [c.c.a.t.Request] (DirectAgent-136:ctx-9bc0a401) Seq 19-4511199451741686482:
Processing:  { Ans: , MgmtId: 345051122106, via: 19, Ver: v1, Flags: 10, [{"com.cloud.agent.api.StopAnswer":{"result":true,"details":"VM
r-342-VM is no longer in vSphere","wait":0}}] }
> 2015-09-15 09:37:06,578 DEBUG [c.c.a.t.Request] (DirectAgentCronJob-253:ctx-c4f59216)
Seq 19-4511199451741686482: Received:  { Ans: , MgmtId: 345051122106, via: 19, Ver: v1, Flags:
10, { StopAnswer } }
> 2015-09-15 09:37:06,585 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (DirectAgentCronJob-253:ctx-c4f59216)
Successfully updated user statistics as a part of domR VM[DomainRouter|r-342-VM] reboot/stop
> 2015-09-15 09:37:06,617 DEBUG [c.c.c.CapacityManagerImpl] (DirectAgentCronJob-253:ctx-c4f59216)
VM state transitted from :Running to Stopped with event: FollowAgentPowerOffReportvm's original
host id: 20 new host id: null host id before state transition: 19
> 2015-09-15 09:37:06,627 WARN  [o.a.c.alerts] (DirectAgentCronJob-253:ctx-c4f59216)  alertType::
27 // dataCenterId:: 1 // podId:: 1 // clusterId:: null // message:: VM state sync alert
> 2015-09-15 09:37:06,630 DEBUG [c.c.a.AlertManagerImpl] (DirectAgentCronJob-253:ctx-c4f59216)
Have already sent: 1 emails for alert type '27' -- skipping send email
> 2015-09-15 09:37:06,630 INFO  [c.c.v.VirtualMachineManagerImpl] (DirectAgentCronJob-253:ctx-c4f59216)
VM r-342-VM is sync-ed to at Stopped state according to power-off report from hypervisor
> 2015-09-15 09:37:06,630 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-253:ctx-c4f59216)
Done with process of VM state report. host: 19
> 2015-09-15 09:37:07,506 DEBUG [c.c.c.ConsoleProxyManagerImpl] (consoleproxy-1:ctx-90d7ef3a)
Zone 1 is ready to launch console proxy
> 2015-09-15 09:37:07,524 INFO  [c.c.h.v.r.VmwareResource] (DirectAgentCronJob-64:ctx-47298215)
Scan hung worker VM to recycle
> 2015-09-15 09:37:07,611 INFO  [c.c.h.v.r.VmwareResource] (DirectAgentCronJob-254:ctx-757ec6a9)
Scan hung worker VM to recycle
> 2015-09-15 09:37:07,722 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgentCronJob-64:ctx-47298215)
Ping from 25(cu01-testpod01-esx02.stxt.media.int)
> 2015-09-15 09:37:07,722 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-64:ctx-47298215)
Process host VM state report from ping process. host: 25
> 2015-09-15 09:37:07,722 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-64:ctx-47298215)
Process VM state report. host: 25, number of records in report: 0
> 2015-09-15 09:37:07,723 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-64:ctx-47298215)
Done with process of VM state report. host: 25
> 2015-09-15 09:37:07,732 DEBUG [o.a.c.s.SecondaryStorageManagerImpl] (secstorage-1:ctx-d63339a0)
Zone 1 is ready to launch secondary storage VM
> 2015-09-15 09:37:07,774 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgentCronJob-254:ctx-757ec6a9)
Ping from 20(cu01-testpod01-esx04.stxt.media.int)
> 2015-09-15 09:37:07,774 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
Process host VM state report from ping process. host: 20
> 2015-09-15 09:37:07,806 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
Process VM state report. host: 20, number of records in report: 11
> 2015-09-15 09:37:07,806 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
VM state report. host: 20, vm id: 356, power state: PowerOn
> 2015-09-15 09:37:07,809 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
VM power state does not change, skip DB writing. vm id: 356
> 2015-09-15 09:37:07,809 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
VM state report. host: 20, vm id: 342, power state: PowerOn
> 2015-09-15 09:37:07,825 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
VM state report is updated. host: 20, vm id: 342, power state: PowerOn
> 2015-09-15 09:37:07,832 INFO  [c.c.v.VirtualMachineManagerImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
VM r-342-VM is at Stopped and we received a power-on report while there is no pending jobs
on it
> 2015-09-15 09:37:07,845 INFO  [c.c.n.r.VirtualNetworkApplianceManagerImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
Schedule a router reboot task as router 342 is powered-on out-of-band. we need to reboot to
refresh network rules
> 2015-09-15 09:37:07,846 INFO  [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterMonitor-1:ctx-c118d628)
Reboot router 342 to refresh network rules
> 2015-09-15 09:37:07,847 DEBUG [c.c.c.CapacityManagerImpl] (DirectAgentCronJob-254:ctx-757ec6a9)
VM state transitted from :Stopped to Running with event: FollowAgentPowerOnReportvm's original
host id: 20 new host id: 20 host id before state transition: null
> {code}
> h2. user vm log for  i-2-355-VM 
> {code:none}
> 2015-09-14 17:35:06,525 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
Run missing VM report. current time: 1442244906525
> 2015-09-14 17:35:06,525 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
Detected missing VM. host: 19, vm id: 355, power state: PowerReportMissing, last state update:
1441982337000
> 2015-09-14 17:35:06,525 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
vm id: 355 - time since last state update(262569525ms) has passed graceful period
> 2015-09-14 17:35:06,534 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
VM state report is updated. host: 19, vm id: 355, power state: PowerReportMissing 
> 2015-09-14 17:35:06,540 INFO  [c.c.v.VirtualMachineManagerImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
VM i-2-355-VM is at Running and we received a power-off report while there is no pending jobs
on it
> 2015-09-14 17:35:06,546 DEBUG [c.c.a.t.Request] (DirectAgentCronJob-36:ctx-08ae2caf)
Seq 19-4511199451741683988: Sending  { Cmd , MgmtId: 345051122106, via: 19(cu01-testpod01-esx03.stxt.media.int),
Ver: v1, Flags: 100011, [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"i-2-355-VM","wait":0}}]
}
> 2015-09-14 17:35:06,546 DEBUG [c.c.a.t.Request] (DirectAgentCronJob-36:ctx-08ae2caf)
Seq 19-4511199451741683988: Executing:  { Cmd , MgmtId: 345051122106, via: 19(cu01-testpod01-esx03.stxt.media.int),
Ver: v1, Flags: 100011, [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"i-2-355-VM","wait":0}}]
}
> 2015-09-14 17:35:06,546 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgent-424:ctx-4793889d)
Seq 19-4511199451741683988: Executing request
> 2015-09-14 17:35:06,547 INFO  [c.c.h.v.r.VmwareResource] (DirectAgent-424:ctx-4793889d
cu01-testpod01-esx03.stxt.media.int, cmd: StopCommand) Executing resource StopCommand: {"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"i-2-355-VM","wait":0}
> 2015-09-14 17:35:06,565 DEBUG [c.c.h.v.m.HostMO] (DirectAgent-424:ctx-4793889d cu01-testpod01-esx03.stxt.media.int,
cmd: StopCommand) find VM i-2-355-VM on host
> 2015-09-14 17:35:06,565 INFO  [c.c.h.v.m.HostMO] (DirectAgent-424:ctx-4793889d cu01-testpod01-esx03.stxt.media.int,
cmd: StopCommand) VM i-2-355-VM not found in host cache
> 2015-09-14 17:35:06,565 DEBUG [c.c.h.v.m.HostMO] (DirectAgent-424:ctx-4793889d cu01-testpod01-esx03.stxt.media.int,
cmd: StopCommand) load VM cache on host
> 2015-09-14 17:35:06,595 INFO  [c.c.h.v.r.VmwareResource] (DirectAgent-424:ctx-4793889d
cu01-testpod01-esx03.stxt.media.int, cmd: StopCommand) VM i-2-355-VM is no longer in vSphere
> 2015-09-14 17:35:06,595 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgent-424:ctx-4793889d)
Seq 19-4511199451741683988: Response Received: 
> 2015-09-14 17:35:06,595 DEBUG [c.c.a.t.Request] (DirectAgent-424:ctx-4793889d) Seq 19-4511199451741683988:
Processing:  { Ans: , MgmtId: 345051122106, via: 19, Ver: v1, Flags: 10, [{"com.cloud.agent.api.StopAnswer":{"result":true,"details":"VM
i-2-355-VM is no longer in vSphere","wait":0}}] }
> 2015-09-14 17:35:06,595 DEBUG [c.c.a.t.Request] (DirectAgentCronJob-36:ctx-08ae2caf)
Seq 19-4511199451741683988: Received:  { Ans: , MgmtId: 345051122106, via: 19, Ver: v1, Flags:
10, { StopAnswer } }
> 2015-09-14 17:35:06,666 DEBUG [c.c.n.NetworkModelImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
Service SecurityGroup is not supported in the network id=211
> 2015-09-14 17:35:06,672 DEBUG [c.c.n.NetworkModelImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
Service SecurityGroup is not supported in the network id=211
> 2015-09-14 17:35:06,672 DEBUG [c.c.c.CapacityManagerImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
VM state transitted from :Running to Stopped with event: FollowAgentPowerOffReportvm's original
host id: 19 new host id: null host id before state transition: 19
> 2015-09-14 17:35:06,689 WARN  [o.a.c.alerts] (DirectAgentCronJob-36:ctx-08ae2caf)  alertType::
27 // dataCenterId:: 1 // podId:: 1 // clusterId:: null // message:: VM state sync alert
> 2015-09-14 17:35:06,692 DEBUG [c.c.a.AlertManagerImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
Have already sent: 1 emails for alert type '27' -- skipping send email
> 2015-09-14 17:35:06,692 INFO  [c.c.v.VirtualMachineManagerImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
VM i-2-355-VM is sync-ed to at Stopped state according to power-off report from hypervisor
> 2015-09-14 17:35:06,692 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-36:ctx-08ae2caf)
Done with process of VM state report. host: 19
> ...
> 2015-09-14 17:35:07,754 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-137:ctx-80d2695e)
VM state report is updated. host: 20, vm id: 355, power state: PowerOn
> 2015-09-14 17:35:07,762 INFO  [c.c.v.VirtualMachineManagerImpl] (DirectAgentCronJob-137:ctx-80d2695e)
VM i-2-355-VM is at Stopped and we received a power-on report while there is no pending jobs
on it
> 2015-09-14 17:35:07,793 DEBUG [c.c.c.CapacityManagerImpl] (DirectAgentCronJob-137:ctx-80d2695e)
VM state transitted from :Stopped to Running with event: FollowAgentPowerOnReportvm's original
host id: 19 new host id: 20 host id before state transition: nul
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message