Return-Path: X-Original-To: apmail-cloudstack-dev-archive@www.apache.org Delivered-To: apmail-cloudstack-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D827A10AEB for ; Wed, 25 Sep 2013 22:14:01 +0000 (UTC) Received: (qmail 73711 invoked by uid 500); 25 Sep 2013 22:14:01 -0000 Delivered-To: apmail-cloudstack-dev-archive@cloudstack.apache.org Received: (qmail 73678 invoked by uid 500); 25 Sep 2013 22:14:01 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 73670 invoked by uid 99); 25 Sep 2013 22:14:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Sep 2013 22:14:00 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of imusayev@webmd.net designates 216.32.180.185 as permitted sender) Received: from [216.32.180.185] (HELO co1outboundpool.messaging.microsoft.com) (216.32.180.185) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Sep 2013 22:13:53 +0000 Received: from mail214-co1-R.bigfish.com (10.243.78.230) by CO1EHSOBE010.bigfish.com (10.243.66.73) with Microsoft SMTP Server id 14.1.225.22; Wed, 25 Sep 2013 22:13:32 +0000 Received: from mail214-co1 (localhost [127.0.0.1]) by mail214-co1-R.bigfish.com (Postfix) with ESMTP id 07B97D00103 for ; Wed, 25 Sep 2013 22:13:32 +0000 (UTC) X-Forefront-Antispam-Report: CIP:67.130.38.18;KIP:(null);UIP:(null);IPV:NLI;H:EXHT02L-CRP-06.webmdhealth.net;RD:none;EFVD:NLI X-SpamScore: -6 X-BigFish: VPS-6(zz9371I1527I936eI542I1432I14ffIzz1f42h208ch1ee6h1de0h1fdah2073h1202h1e76h1d1ah1d2ah1fc6hzz1de098h1de097h186068h8275bh8275dhz31h2a8h839h944hd25hf0ah1220h1288h12a5h12a9h12bdh137ah13b6h1441h14ddh1504h1537h153bh15d0h162dh1631h1758h18e1h1946h19b5h1ad9h1b0ah1b2fh1fb3h1d0ch1d2eh1d3fh1dfeh1dffh1e1dh1fe8h1ff5h1155h) Received-SPF: softfail (mail214-co1: transitioning domain of webmd.net does not designate 67.130.38.18 as permitted sender) client-ip=67.130.38.18; envelope-from=imusayev@webmd.net; helo=EXHT02L-CRP-06.webmdhealth.net ;mdhealth.net ; Received: from mail214-co1 (localhost.localdomain [127.0.0.1]) by mail214-co1 (MessageSwitch) id 138014721024580_32586; Wed, 25 Sep 2013 22:13:30 +0000 (UTC) Received: from CO1EHSMHS024.bigfish.com (unknown [10.243.78.236]) by mail214-co1.bigfish.com (Postfix) with ESMTP id 022F6CC0041 for ; Wed, 25 Sep 2013 22:13:30 +0000 (UTC) Received: from EXHT02L-CRP-06.webmdhealth.net (67.130.38.18) by CO1EHSMHS024.bigfish.com (10.243.66.34) with Microsoft SMTP Server (TLS) id 14.16.227.3; Wed, 25 Sep 2013 22:13:29 +0000 Received: from EXDFO01L-CRP-06.webmdhealth.net ([fe80::9dd9:c35b:dd4e:f7cf]) by EXHT02L-CRP-06.webmdhealth.net ([::1]) with mapi id 14.03.0123.003; Wed, 25 Sep 2013 18:13:27 -0400 From: "Musayev, Ilya" To: "dev@cloudstack.apache.org" Subject: RE: [MAJOR][BUG] ACS powers off some VMs in vSphere - when MS service is restarted Thread-Topic: [MAJOR][BUG] ACS powers off some VMs in vSphere - when MS service is restarted Thread-Index: Ac66O37TOonso4bfSbWKFjNwyC4AFAAAJvxg Date: Wed, 25 Sep 2013 22:13:26 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.46.40.104] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: webmd.net X-FOPE-CONNECTOR: Id%0$Dn%*$RO%0$TLS%0$FQDN%$TlsDn% X-Virus-Checked: Checked by ClamAV on apache.org Sorry if the error log is abit hard to read, as an example, please track th= e vmname i-2-262-acs-docs-fc17. 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTaskPo= ol-1:null) Detecting a new state but couldn't find a old state so adding it= to the changes: i-2-262-acs-docs-fc17 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTa= skPool-1:null) VM i-2-262-acs-docs-fc17: cs state =3D Running and realState= =3D Stopped 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTa= skPool-1:null) VM i-2-262-acs-docs-fc17: cs state =3D Running and realState= =3D Stopped 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] (Agent= TaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:nu= ll) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: 11, Ver:= v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-= docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-dem= o01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-a= cs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-= vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-2= 85-ossec01l-ops-08","wait":0}}] } 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:nu= ll) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, via: 11, V= er: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-a= cs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-= demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-28= 1-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-2= 83-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-= 2-285-ossec01l-ops-08","wait":0}}] } 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] (DirectAgent= -3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand: {"isPro= xy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0} 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-ops= -08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] (DirectAgent= -3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already in = stopped state > -----Original Message----- > From: Musayev, Ilya [mailto:imusayev@webmd.net] > Sent: Wednesday, September 25, 2013 6:08 PM > To: dev@cloudstack.apache.org > Subject: [MAJOR][BUG] ACS powers off some VMs in vSphere - when MS > service is restarted >=20 > Hi All, >=20 > I'd like to raise an awareness on the issue with ACS 4.1.1 (I assume othe= r > versions are affected - since I could not find any changes in the latest = code > that would state otherwise). >=20 > On MS server start/restart, it checks all the VMs for their state, if som= e > reason state is either not found or comes as stopped - which is inaccurat= e, > the vms will *power off*. >=20 > On todays occurrence, half of my plant went down because ACS invoked > StopCommand on the vms that either had no state or for some reason > agentState =3D=3D Stopped. >=20 > The details of this issue is in > https://issues.apache.org/jira/browse/CLOUDSTACK-4740 >=20 > The error in the logs is: >=20 > 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] > (AgentTaskPool-1:null) Detecting a new state but couldn't find a old stat= e so > adding it to the changes: i-2-262-acs-docs-fc17 > 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool- > 1:null) Seq -1--1: Startup request from directly connected host: { Cmd , > MgmtId: -1, via: -1, Ver: v1, Flags: 11, > [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"po > olSync":false,"vms":{"i-8-270-CLOUD411":{"state":"Running"},"r-15- > CLOUD41-OLD":{"state":"Stopped"},"v-260- > CLOUD411":{"state":"Running"},"i-2-283-vmbld01l-ops- > 08":{"state":"Running"},"i-2-104-ACS41VM":{"state":"Running"},"--s-1- > CLOUD41-OLD":{"state":"Running"},"i-27-280- > CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops- > 08":{"state":"Running"},"i-2-262-acs-docs-fc17":{"state":"Stopped"},"i-24= - > 265-test3":{"state":"Running"},"cloud01l-ops- > 08.portal.webmd.com":{"state":"Running"},"i-2-278-demo01t-ops- > 08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66- > CLOUD411":{"state":"Running"},"i-2-281-acs- > appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware"," > hostDetails":{"com.cloud.network.Networks.RouterPrivateIpStrategy":"DcGl > obal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing","data= C > enter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc00q- > ops-08.portal.webmd.com","name":"vmha62d-ops- > 08.portal.webmd.com","version":"4.1.1- > SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68:b5 > :99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25= . > 243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:99:7 > 3:0b:c2","wait":0},{"totalSize":0,"poolInfo":{"uuid":"72c8aedb-58c4-4569- > ac51-adc5af770bf6","host":"vmha62d-ops- > 08.portal.webmd.com","localPath":"","hostPath":"datastore- > 19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":14 > 0383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Stor > age","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-4569- > ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51- > adc5af770bf6","wait":0}] } > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state =3D Running and > realState =3D Stopped > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state =3D Running and > realState =3D Stopped > 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17 > 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool- > 1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: 11, > Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2- > 262-acs-docs- > fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278- > demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i- > 2-281-acs- > appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283- > vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i- > 2-285-ossec01l-ops-08","wait":0}}] } > 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool- > 1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, via: = 11, > Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2- > 262-acs-docs- > fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278- > demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i- > 2-281-acs- > appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283- > vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i- > 2-285-ossec01l-ops-08","wait":0}}] } > 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource > StopCommand: {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0} > 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent- > 3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on > host > 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs- > fc17 is already in stopped state > 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent- > 3:null) Seq 11-1418264581: Processing: { Ans: , MgmtId: 345049078181, via= : 11, > Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":= "VM > i-2-262-acs-docs-fc17 is already in stopped > state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Sto= p > VM i-2-278-demo01t-ops-08 > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"S= top > VM i-2-281-acs-appliance > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"S= top > VM i-2-283-vmbld01l-ops-08 > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"S= top > VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] } > 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent- > 20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on > host >=20 > I wonder why this was not noted earlier. >=20 > The code in question is below, and comes from > engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java > on master branch. >=20 > 2365 if (serverState =3D=3D State.Running) { > 2366 // Our records showed that it should be running so let's restart > 2367 // it. > 2368 _haMgr.scheduleRestart(vm, false); > 2369 } else if (serverState =3D=3D State.Stopping) { > 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop); > 2371 s_logger.debug("Scheduling a check stop for VM in stopping mode: " + > vm); > 2372 } else if (serverState =3D=3D State.Starting) { > 2373 s_logger.debug("Ignoring VM in starting mode: " + > vm.getInstanceName()); > 2374 _haMgr.scheduleRestart(vm, false); > 2375 } > 2376 command =3D cleanup(vm); >=20 > If there is a quick fix, which I think for now, I can just comment out th= e > "command=3D cleanup(vm)", please let me know. I'm not certain what would > be the repercussions. >=20 > As always any feedback, is appreciated. >=20 > Thanks > ilya >=20 >=20