Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cloudstack.apache.org
Date: Wed, 9 Oct 2013 10:57:41 +0000 (UTC)
From: "Abhinandan Prateek (JIRA)" <jira@apache.org>
To: cloudstack-issues@incubator.apache.org
Message-ID: <JIRA.12670580.1380145319191.37688.1381316261775@arcas>
In-Reply-To: <JIRA.12670580.1380145319191@arcas>
References: <JIRA.12670580.1380145319191@arcas>
Subject: [jira] [Updated] (CLOUDSTACK-4740) Some vSphere VMs are shutdown
 when ACS is restarted
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/CLOUDSTACK-4740?page=3Dcom.atl=
assian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Abhinandan Prateek updated CLOUDSTACK-4740:
-------------------------------------------

    Fix Version/s:     (was: 4.2.1)
                   4.3.0

> Some vSphere VMs are shutdown when ACS is restarted
> ---------------------------------------------------
>
>                 Key: CLOUDSTACK-4740
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-474=
0
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the defa=
ult.)=20
>          Components: Management Server
>    Affects Versions: 4.1.0, 4.1.1, 4.2.0, Future
>         Environment: I'm running ACS 4.1.1 with vSphere 5.1
>            Reporter: ilya musayev
>            Priority: Critical
>              Labels: management, poweroff
>             Fix For: 4.3.0
>
>
> If management server is restarted, when management server starts - it che=
cks whether the agentState for vSphere VMs and if it does not get a proper =
response, it marks them as stopped.
> As the result, some of my virtual instances would shutdown.
> Attempting to analyze this issue further, here are my findings and errors=
 seen in the log.
> 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTask=
Pool-1:null) Detecting a new state but couldn't find a old state so adding =
it to the changes: i-2-262-acs-docs-fc17
> 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool-1:=
null) Seq -1--1: Startup request from directly connected host:  { Cmd , Mgm=
tId: -1, via: -1, Ver: v1, Flags: 11, [{"cpus":16,"speed":2199,"memory":686=
83468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":{"sta=
te":"Running"},"r-15-CLOUD41-OLD":{"state":"Stopped"},"v-260-CLOUD411":{"st=
ate":"Running"},"i-2-283-vmbld01l-ops-08":{"state":"Running"},"i-2-104-ACS4=
1VM":{"state":"Running"},"--s-1-CLOUD41-OLD":{"state":"Running"},"i-27-280-=
CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops-08":{"state":"Running"}=
,"i-2-262-acs-docs-fc17":{"state":"Stopped"},"i-24-265-test3":{"state":"Run=
ning"},"cloud01l-ops-08.portal.webmd.com":{"state":"Running"},"i-2-278-demo=
01t-ops-08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66-C=
LOUD411":{"state":"Running"},"i-2-281-acs-appliance":{"state":"Running"}},"=
caps":"hvm","hypervisorType":"VMware","hostDetails":{"com.cloud.network.Net=
works.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"},"hypervisorVer=
sion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid=
":"HostSystem:host-19716@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops=
-08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25=
.243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.=
255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0",=
"storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo"=
:{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.port=
al.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM",=
"capacityBytes":141465485312,"availableBytes":140383354880},"resourceType":=
"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2"=
,"cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8ae=
db-58c4-4569-ac51-adc5af770bf6","wait":0}] }
> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (Agent=
TaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state =3D Running and realSta=
te =3D Stopped
> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (Agent=
TaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state =3D Running and realSta=
te =3D Stopped
> 2013-09-25 14:35:53,614 INFO  [cloud.ha.HighAvailabilityManagerImpl] (Age=
ntTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
> 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:=
null) Seq 11-1418264581: Sending  { Cmd , MgmtId: 345049078181, via: 11, Ve=
r: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-ac=
s-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-d=
emo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281=
-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-28=
3-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2=
-285-ossec01l-ops-08","wait":0}}] }
> 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:=
null) Seq 11-1418264581: Executing:  { Cmd , MgmtId: 345049078181, via: 11,=
 Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262=
-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-27=
8-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-=
281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2=
-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"=
i-2-285-ossec01l-ops-08","wait":0}}] }
> 2013-09-25 14:35:53,702 INFO  [vmware.resource.VmwareResource] (DirectAge=
nt-3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand: {"isP=
roxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}
> 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-o=
ps-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
> 2013-09-25 14:35:54,753 INFO  [vmware.resource.VmwareResource] (DirectAge=
nt-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already i=
n stopped state
> 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:nu=
ll) Seq 11-1418264581: Processing:  { Ans: , MgmtId: 345049078181, via: 11,=
 Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"=
VM i-2-262-acs-docs-fc17 is already in stopped state","wait":0}},{"StopAnsw=
er":{"vncPort":0,"result":true,"details":"Stop VM i-2-278-demo01t-ops-08 Su=
cceed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop =
VM i-2-281-acs-appliance Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"re=
sult":true,"details":"Stop VM i-2-283-vmbld01l-ops-08 Succeed","wait":0}},{=
"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-285-ossec01l=
-ops-08 Succeed","wait":0}}] }
> 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-20:vmha62d-=
ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
> If for whatever reason agentState is stopped for vSphere, we should not f=
orce power off the VM, neither do we have to restart it.
> The code that causes the restart is in engine/orchestration/src/com/cloud=
/vm/VirtualMachineManagerImpl.java,
> lines 2365 - 2376
> In this snippet we do the check if cs state =3D=3D Running and agentState=
 =3D=3D Stopped=20
> We should not power off the VM if we cannot retrieve the state!
> 2233     protected Command compareState(long hostId, VMInstanceVO vm, fin=
al AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) {
> 2234         State agentState =3D info.state;
> 2235         final State serverState =3D vm.getState();
> 2236         final String serverName =3D vm.getInstanceName();
> 2237=20
> 2238         Command command =3D null;
> 2239         s_logger.debug("VM " + serverName + ": cs state =3D " + serv=
erState + " and realState =3D " + agentState);
> 2240         if (s_logger.isDebugEnabled()) {
> 2241             s_logger.debug("VM " + serverName + ": cs state =3D " + =
serverState + " and realState =3D " + agentState);
> 2242         }
> 2243=20
> 2244         if (agentState =3D=3D State.Error) {
> 2245             agentState =3D State.Stopped;
> 2246=20
> 2247             short alertType =3D AlertManager.ALERT_TYPE_USERVM;
> 2248             if (VirtualMachine.Type.DomainRouter.equals(vm.getType()=
)) {
> 2249                 alertType =3D AlertManager.ALERT_TYPE_DOMAIN_ROUTER;
> 2250             } else if (VirtualMachine.Type.ConsoleProxy.equals(vm.ge=
tType())) {
> 2251                 alertType =3D AlertManager.ALERT_TYPE_CONSOLE_PROXY;
> 2252             } else if (VirtualMachine.Type.SecondaryStorageVm.equals=
(vm.getType())) {
> 2253                 alertType =3D AlertManager.ALERT_TYPE_SSVM;
> 2254             }
> 2255=20
> 2256             HostPodVO podVO =3D _podDao.findById(vm.getPodIdToDeploy=
In());
> 2257             DataCenterVO dcVO =3D _dcDao.findById(vm.getDataCenterId=
());
> 2258             HostVO hostVO =3D _hostDao.findById(vm.getHostId());
> 2259=20
> 2260             String hostDesc =3D "name: " + hostVO.getName() + " (id:=
" + hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " =
+ podVO.getName();
> 2261             _alertMgr.sendAlert(alertType, vm.getDataCenterId(), vm.=
getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + vm.g=
etId() + ") stopped on host " +
> 2262                                                                     =
                      hostDesc + " due to storage failure",
> 2263                 "Virtual Machine " + vm.getInstanceName() + " (id: "=
 + vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to =
storage failure.");
> 2264         }
> 2265=20
> 2266         if (trackExternalChange) {
> 2267             if (serverState =3D=3D State.Starting) {
> 2268                 if (vm.getHostId() !=3D null && vm.getHostId() !=3D =
hostId) {
> 2269                     s_logger.info("CloudStack is starting VM on host=
 " + vm.getHostId() + ", but status report comes from a different host " + =
hostId +
> 2270                                   ", skip status sync for vm: " + vm=
.getInstanceName());
> 2271                     return null;
> 2272                 }
> 2273             }
> 2274             if (vm.getHostId() =3D=3D null || hostId !=3D vm.getHost=
Id()) {
> 2275                 try {
> 2276                     ItWorkVO workItem =3D _workDao.findByOutstanding=
Work(vm.getId(), State.Migrating);
> 2277                     if (workItem =3D=3D null) {
> 2278                         stateTransitTo(vm, VirtualMachine.Event.Agen=
tReportMigrated, hostId);
> 2279                     }
> 2280                 } catch (NoTransitionException e) {
> 2281                 }
> 2282             }
> 2283         }
> 2284=20
> 2285         // during VM migration time, don't sync state will agent sta=
tus update
> 2286         if (serverState =3D=3D State.Migrating) {
> 2287             s_logger.debug("Skipping vm in migrating state: " + vm);
> 2288             return null;
> 2289         }
> 2290=20
> 2291         if (trackExternalChange) {
> 2292             if (serverState =3D=3D State.Starting) {
> 2293                 if (vm.getHostId() !=3D null && vm.getHostId() !=3D =
hostId) {
> 2294                     s_logger.info("CloudStack is starting VM on host=
 " + vm.getHostId() + ", but status report comes from a different host " + =
hostId +
> 2295                                   ", skip status sync for vm: " + vm=
.getInstanceName());
> 2296                     return null;
> 2297                 }
> 2298             }
> 2299=20
> 2300             if (serverState =3D=3D State.Running) {
> 2301                 try {
> 2302                     //
> 2303                     // we had a bug that sometimes VM may be at Runn=
ing State
> 2304                     // but host_id is null, we will cover it here.
> 2305                     // means that when CloudStack DB lost of host in=
formation,
> 2306                     // we will heal it with the info reported from h=
ost
> 2307                     //
> 2308                     if (vm.getHostId() =3D=3D null || hostId !=3D vm=
.getHostId()) {
> 2309                         if (s_logger.isDebugEnabled()) {
> 2310                             s_logger.debug("detected host change whe=
n VM " + vm + " is at running state, VM could be live-migrated externally f=
rom host " + vm.getHostId() +
> 2311                                            " to host " + hostId);
> 2312                         }
> 2313=20
> 2314                         stateTransitTo(vm, VirtualMachine.Event.Agen=
tReportMigrated, hostId);
> 2315                     }
> 2316                 } catch (NoTransitionException e) {
> 2317                     s_logger.warn(e.getMessage());
> 2318                 }
> 2319             }
> 2320         }
> 2321=20
> 2322         if (agentState =3D=3D serverState) {
> 2323             if (s_logger.isDebugEnabled()) {
> 2324                 s_logger.debug("Both states are " + agentState + " f=
or " + vm);
> 2325             }
> 2326             assert (agentState =3D=3D State.Stopped || agentState =
=3D=3D State.Running) : "If the states we send up is changed, this must be =
changed.";
> 2327             if (agentState =3D=3D State.Running) {
> 2328                 try {
> 2329                     stateTransitTo(vm, VirtualMachine.Event.AgentRep=
ortRunning, hostId);
> 2330                 } catch (NoTransitionException e) {
> 2331                     s_logger.warn(e.getMessage());
> 2332                 }
> 2333                 // FIXME: What if someone comes in and sets it to st=
opping? Then
> 2334                 // what?
> 2335                 return null;
> 2336             }
> 2337=20
> 2338             s_logger.debug("State matches but the agent said stopped=
 so let's send a cleanup command anyways.");
> 2339             return cleanup(vm);
> 2340         }
> 2341=20
> 2342         if (agentState =3D=3D State.Shutdowned) {
> 2343             if (serverState =3D=3D State.Running || serverState =3D=
=3D State.Starting || serverState =3D=3D State.Stopping) {
> 2344                 try {
> 2345                     advanceStop(vm, true);
> 2346                 } catch (AgentUnavailableException e) {
> 2347                     assert (false) : "How do we hit this with forced=
 on?";
> 2348                     return null;
> 2349                 } catch (OperationTimedoutException e) {
> 2350                     assert (false) : "How do we hit this with forced=
 on?";
> 2351                     return null;
> 2352                 } catch (ConcurrentOperationException e) {
> 2353                     assert (false) : "How do we hit this with forced=
 on?";
> 2354                     return null;
> 2355                 }
> 2356             } else {
> 2357                 s_logger.debug("Sending cleanup to a shutdowned vm: =
" + vm.getInstanceName());
> 2358                 command =3D cleanup(vm);
> 2359             }
> 2360         } else if (agentState =3D=3D State.Stopped) {
> 2361             // This state means the VM on the agent was detected pre=
viously
> 2362             // and now is gone. This is slightly different than if t=
he VM
> 2363             // was never completed but we still send down a Stop Com=
mand
> 2364             // to ensure there's cleanup.
> 2365             if (serverState =3D=3D State.Running) {
> 2366                 // Our records showed that it should be running so l=
et's restart
> 2367                 // it.
> 2368                 _haMgr.scheduleRestart(vm, false);
> 2369             } else if (serverState =3D=3D State.Stopping) {
> 2370                 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop);
> 2371                 s_logger.debug("Scheduling a check stop for VM in st=
opping mode: " + vm);
> 2372             } else if (serverState =3D=3D State.Starting) {
> 2373                 s_logger.debug("Ignoring VM in starting mode: " + vm=
.getInstanceName());
> 2374                 _haMgr.scheduleRestart(vm, false);
> 2375             }
> 2376             command =3D cleanup(vm);
> I'd be happy to test the fix if such is comes out.
> Thanks
> ilya


--
This message was sent by Atlassian JIRA
(v6.1#6144)