Return-Path: X-Original-To: apmail-cloudstack-issues-archive@www.apache.org Delivered-To: apmail-cloudstack-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E1692102A9 for ; Wed, 9 Oct 2013 10:57:42 +0000 (UTC) Received: (qmail 60682 invoked by uid 500); 9 Oct 2013 10:57:42 -0000 Delivered-To: apmail-cloudstack-issues-archive@cloudstack.apache.org Received: (qmail 60512 invoked by uid 500); 9 Oct 2013 10:57:42 -0000 Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list issues@cloudstack.apache.org Received: (qmail 60505 invoked by uid 500); 9 Oct 2013 10:57:42 -0000 Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org Received: (qmail 60499 invoked by uid 99); 9 Oct 2013 10:57:42 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2013 10:57:42 +0000 Date: Wed, 9 Oct 2013 10:57:41 +0000 (UTC) From: "Abhinandan Prateek (JIRA)" To: cloudstack-issues@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CLOUDSTACK-4740) Some vSphere VMs are shutdown when ACS is restarted MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CLOUDSTACK-4740?page=3Dcom.atl= assian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhinandan Prateek updated CLOUDSTACK-4740: ------------------------------------------- Fix Version/s: (was: 4.2.1) 4.3.0 > Some vSphere VMs are shutdown when ACS is restarted > --------------------------------------------------- > > Key: CLOUDSTACK-4740 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-474= 0 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the defa= ult.)=20 > Components: Management Server > Affects Versions: 4.1.0, 4.1.1, 4.2.0, Future > Environment: I'm running ACS 4.1.1 with vSphere 5.1 > Reporter: ilya musayev > Priority: Critical > Labels: management, poweroff > Fix For: 4.3.0 > > > If management server is restarted, when management server starts - it che= cks whether the agentState for vSphere VMs and if it does not get a proper = response, it marks them as stopped. > As the result, some of my virtual instances would shutdown. > Attempting to analyze this issue further, here are my findings and errors= seen in the log. > 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTask= Pool-1:null) Detecting a new state but couldn't find a old state so adding = it to the changes: i-2-262-acs-docs-fc17 > 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool-1:= null) Seq -1--1: Startup request from directly connected host: { Cmd , Mgm= tId: -1, via: -1, Ver: v1, Flags: 11, [{"cpus":16,"speed":2199,"memory":686= 83468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":{"sta= te":"Running"},"r-15-CLOUD41-OLD":{"state":"Stopped"},"v-260-CLOUD411":{"st= ate":"Running"},"i-2-283-vmbld01l-ops-08":{"state":"Running"},"i-2-104-ACS4= 1VM":{"state":"Running"},"--s-1-CLOUD41-OLD":{"state":"Running"},"i-27-280-= CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops-08":{"state":"Running"}= ,"i-2-262-acs-docs-fc17":{"state":"Stopped"},"i-24-265-test3":{"state":"Run= ning"},"cloud01l-ops-08.portal.webmd.com":{"state":"Running"},"i-2-278-demo= 01t-ops-08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66-C= LOUD411":{"state":"Running"},"i-2-281-acs-appliance":{"state":"Running"}},"= caps":"hvm","hypervisorType":"VMware","hostDetails":{"com.cloud.network.Net= works.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"},"hypervisorVer= sion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid= ":"HostSystem:host-19716@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops= -08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25= .243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.= 255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0",= "storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo"= :{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.port= al.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM",= "capacityBytes":141465485312,"availableBytes":140383354880},"resourceType":= "STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2"= ,"cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8ae= db-58c4-4569-ac51-adc5af770bf6","wait":0}] } > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (Agent= TaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state =3D Running and realSta= te =3D Stopped > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (Agent= TaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state =3D Running and realSta= te =3D Stopped > 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] (Age= ntTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17 > 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:= null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: 11, Ve= r: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-ac= s-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-d= emo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281= -acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-28= 3-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2= -285-ossec01l-ops-08","wait":0}}] } > 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:= null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, via: 11,= Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262= -acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-27= 8-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-= 281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2= -283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"= i-2-285-ossec01l-ops-08","wait":0}}] } > 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] (DirectAge= nt-3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand: {"isP= roxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0} > 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-o= ps-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host > 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] (DirectAge= nt-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already i= n stopped state > 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:nu= ll) Seq 11-1418264581: Processing: { Ans: , MgmtId: 345049078181, via: 11,= Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"= VM i-2-262-acs-docs-fc17 is already in stopped state","wait":0}},{"StopAnsw= er":{"vncPort":0,"result":true,"details":"Stop VM i-2-278-demo01t-ops-08 Su= cceed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop = VM i-2-281-acs-appliance Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"re= sult":true,"details":"Stop VM i-2-283-vmbld01l-ops-08 Succeed","wait":0}},{= "StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-285-ossec01l= -ops-08 Succeed","wait":0}}] } > 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-20:vmha62d-= ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host > If for whatever reason agentState is stopped for vSphere, we should not f= orce power off the VM, neither do we have to restart it. > The code that causes the restart is in engine/orchestration/src/com/cloud= /vm/VirtualMachineManagerImpl.java, > lines 2365 - 2376 > In this snippet we do the check if cs state =3D=3D Running and agentState= =3D=3D Stopped=20 > We should not power off the VM if we cannot retrieve the state! > 2233 protected Command compareState(long hostId, VMInstanceVO vm, fin= al AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) { > 2234 State agentState =3D info.state; > 2235 final State serverState =3D vm.getState(); > 2236 final String serverName =3D vm.getInstanceName(); > 2237=20 > 2238 Command command =3D null; > 2239 s_logger.debug("VM " + serverName + ": cs state =3D " + serv= erState + " and realState =3D " + agentState); > 2240 if (s_logger.isDebugEnabled()) { > 2241 s_logger.debug("VM " + serverName + ": cs state =3D " + = serverState + " and realState =3D " + agentState); > 2242 } > 2243=20 > 2244 if (agentState =3D=3D State.Error) { > 2245 agentState =3D State.Stopped; > 2246=20 > 2247 short alertType =3D AlertManager.ALERT_TYPE_USERVM; > 2248 if (VirtualMachine.Type.DomainRouter.equals(vm.getType()= )) { > 2249 alertType =3D AlertManager.ALERT_TYPE_DOMAIN_ROUTER; > 2250 } else if (VirtualMachine.Type.ConsoleProxy.equals(vm.ge= tType())) { > 2251 alertType =3D AlertManager.ALERT_TYPE_CONSOLE_PROXY; > 2252 } else if (VirtualMachine.Type.SecondaryStorageVm.equals= (vm.getType())) { > 2253 alertType =3D AlertManager.ALERT_TYPE_SSVM; > 2254 } > 2255=20 > 2256 HostPodVO podVO =3D _podDao.findById(vm.getPodIdToDeploy= In()); > 2257 DataCenterVO dcVO =3D _dcDao.findById(vm.getDataCenterId= ()); > 2258 HostVO hostVO =3D _hostDao.findById(vm.getHostId()); > 2259=20 > 2260 String hostDesc =3D "name: " + hostVO.getName() + " (id:= " + hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " = + podVO.getName(); > 2261 _alertMgr.sendAlert(alertType, vm.getDataCenterId(), vm.= getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + vm.g= etId() + ") stopped on host " + > 2262 = hostDesc + " due to storage failure", > 2263 "Virtual Machine " + vm.getInstanceName() + " (id: "= + vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to = storage failure."); > 2264 } > 2265=20 > 2266 if (trackExternalChange) { > 2267 if (serverState =3D=3D State.Starting) { > 2268 if (vm.getHostId() !=3D null && vm.getHostId() !=3D = hostId) { > 2269 s_logger.info("CloudStack is starting VM on host= " + vm.getHostId() + ", but status report comes from a different host " + = hostId + > 2270 ", skip status sync for vm: " + vm= .getInstanceName()); > 2271 return null; > 2272 } > 2273 } > 2274 if (vm.getHostId() =3D=3D null || hostId !=3D vm.getHost= Id()) { > 2275 try { > 2276 ItWorkVO workItem =3D _workDao.findByOutstanding= Work(vm.getId(), State.Migrating); > 2277 if (workItem =3D=3D null) { > 2278 stateTransitTo(vm, VirtualMachine.Event.Agen= tReportMigrated, hostId); > 2279 } > 2280 } catch (NoTransitionException e) { > 2281 } > 2282 } > 2283 } > 2284=20 > 2285 // during VM migration time, don't sync state will agent sta= tus update > 2286 if (serverState =3D=3D State.Migrating) { > 2287 s_logger.debug("Skipping vm in migrating state: " + vm); > 2288 return null; > 2289 } > 2290=20 > 2291 if (trackExternalChange) { > 2292 if (serverState =3D=3D State.Starting) { > 2293 if (vm.getHostId() !=3D null && vm.getHostId() !=3D = hostId) { > 2294 s_logger.info("CloudStack is starting VM on host= " + vm.getHostId() + ", but status report comes from a different host " + = hostId + > 2295 ", skip status sync for vm: " + vm= .getInstanceName()); > 2296 return null; > 2297 } > 2298 } > 2299=20 > 2300 if (serverState =3D=3D State.Running) { > 2301 try { > 2302 // > 2303 // we had a bug that sometimes VM may be at Runn= ing State > 2304 // but host_id is null, we will cover it here. > 2305 // means that when CloudStack DB lost of host in= formation, > 2306 // we will heal it with the info reported from h= ost > 2307 // > 2308 if (vm.getHostId() =3D=3D null || hostId !=3D vm= .getHostId()) { > 2309 if (s_logger.isDebugEnabled()) { > 2310 s_logger.debug("detected host change whe= n VM " + vm + " is at running state, VM could be live-migrated externally f= rom host " + vm.getHostId() + > 2311 " to host " + hostId); > 2312 } > 2313=20 > 2314 stateTransitTo(vm, VirtualMachine.Event.Agen= tReportMigrated, hostId); > 2315 } > 2316 } catch (NoTransitionException e) { > 2317 s_logger.warn(e.getMessage()); > 2318 } > 2319 } > 2320 } > 2321=20 > 2322 if (agentState =3D=3D serverState) { > 2323 if (s_logger.isDebugEnabled()) { > 2324 s_logger.debug("Both states are " + agentState + " f= or " + vm); > 2325 } > 2326 assert (agentState =3D=3D State.Stopped || agentState = =3D=3D State.Running) : "If the states we send up is changed, this must be = changed."; > 2327 if (agentState =3D=3D State.Running) { > 2328 try { > 2329 stateTransitTo(vm, VirtualMachine.Event.AgentRep= ortRunning, hostId); > 2330 } catch (NoTransitionException e) { > 2331 s_logger.warn(e.getMessage()); > 2332 } > 2333 // FIXME: What if someone comes in and sets it to st= opping? Then > 2334 // what? > 2335 return null; > 2336 } > 2337=20 > 2338 s_logger.debug("State matches but the agent said stopped= so let's send a cleanup command anyways."); > 2339 return cleanup(vm); > 2340 } > 2341=20 > 2342 if (agentState =3D=3D State.Shutdowned) { > 2343 if (serverState =3D=3D State.Running || serverState =3D= =3D State.Starting || serverState =3D=3D State.Stopping) { > 2344 try { > 2345 advanceStop(vm, true); > 2346 } catch (AgentUnavailableException e) { > 2347 assert (false) : "How do we hit this with forced= on?"; > 2348 return null; > 2349 } catch (OperationTimedoutException e) { > 2350 assert (false) : "How do we hit this with forced= on?"; > 2351 return null; > 2352 } catch (ConcurrentOperationException e) { > 2353 assert (false) : "How do we hit this with forced= on?"; > 2354 return null; > 2355 } > 2356 } else { > 2357 s_logger.debug("Sending cleanup to a shutdowned vm: = " + vm.getInstanceName()); > 2358 command =3D cleanup(vm); > 2359 } > 2360 } else if (agentState =3D=3D State.Stopped) { > 2361 // This state means the VM on the agent was detected pre= viously > 2362 // and now is gone. This is slightly different than if t= he VM > 2363 // was never completed but we still send down a Stop Com= mand > 2364 // to ensure there's cleanup. > 2365 if (serverState =3D=3D State.Running) { > 2366 // Our records showed that it should be running so l= et's restart > 2367 // it. > 2368 _haMgr.scheduleRestart(vm, false); > 2369 } else if (serverState =3D=3D State.Stopping) { > 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop); > 2371 s_logger.debug("Scheduling a check stop for VM in st= opping mode: " + vm); > 2372 } else if (serverState =3D=3D State.Starting) { > 2373 s_logger.debug("Ignoring VM in starting mode: " + vm= .getInstanceName()); > 2374 _haMgr.scheduleRestart(vm, false); > 2375 } > 2376 command =3D cleanup(vm); > I'd be happy to test the fix if such is comes out. > Thanks > ilya -- This message was sent by Atlassian JIRA (v6.1#6144)