From issues-return-89187-archive-asf-public=cust-asf.ponee.io@cloudstack.apache.org Wed Jan 17 17:25:07 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 059B318062C for ; Wed, 17 Jan 2018 17:25:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id EA00F160C35; Wed, 17 Jan 2018 16:25:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 105C4160C1B for ; Wed, 17 Jan 2018 17:25:05 +0100 (CET) Received: (qmail 26691 invoked by uid 500); 17 Jan 2018 16:25:05 -0000 Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list issues@cloudstack.apache.org Received: (qmail 26682 invoked by uid 500); 17 Jan 2018 16:25:05 -0000 Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org Received: (qmail 26679 invoked by uid 99); 17 Jan 2018 16:25:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jan 2018 16:25:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C84561805CC for ; Wed, 17 Jan 2018 16:25:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -108.711 X-Spam-Level: X-Spam-Status: No, score=-108.711 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id YSxib1RpwIX4 for ; Wed, 17 Jan 2018 16:25:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 380DF5FDD3 for ; Wed, 17 Jan 2018 16:25:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 40AADE25A3 for ; Wed, 17 Jan 2018 16:25:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id AE0A921306 for ; Wed, 17 Jan 2018 16:25:00 +0000 (UTC) Date: Wed, 17 Jan 2018 16:25:00 +0000 (UTC) From: "Nux (JIRA)" To: cloudstack-issues@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Closed] (CLOUDSTACK-10234) HA fails in cases of PSU failure. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CLOUDSTACK-10234?page=3Dcom.at= lassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nux closed CLOUDSTACK-10234. ---------------------------- Resolution: Later This issue needs a more thorough rethink as it could lead to data corruptio= n. For now Host HA only works as long as the IPMIs are within reach. > HA fails in cases of PSU failure. > --------------------------------- > > Key: CLOUDSTACK-10234 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-102= 34 > Project: CloudStack > Issue Type: Improvement > Security Level: Public(Anyone can view this level - this is the defa= ult.)=20 > Components: Management Server > Affects Versions: 4.11.0.0 > Environment: 4.11 RC1, NFS storage, CentOS 7 management server an= d hypervisors > Reporter: Nux > Assignee: Rohit Yadav > Priority: Major > Labels: HA, KVM > > To simulate PSU failure I pulled the power from the server physically, HA= fails to do the right thing and move the affected VMs to other HVs. > I waited a good while, but alas nothing happened. The VM and VR running o= n the affected hypervisor were never moved to another one (I have another 2= running). > =C2=A0Is there any way to at least force the system to mark that HV as ba= d/offline? > This is what I see in the management server logs: > {code:java} > Caused by: com.cloud.utils.exception.CloudRuntimeException: Out-of-band M= anagement action (OFF) on host (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) faile= d with error: Get Auth Capabilities error Error issuing Get Channel Authent= ication Capabilities request Error: Unable to establish IPMI v2 / RMCP+ ses= sion =C2=A0=C2=A0 =C2=A0at org.apache.cloudstack.outofbandmanagement.OutOfB= andManagementServiceImpl.executePowerOperation(OutOfBandManagementServiceIm= pl.java:423) =C2=A0=C2=A0 =C2=A0at sun.reflect.GeneratedMethodAccessor199.i= nvoke(Unknown Source) =C2=A0=C2=A0 =C2=A0at sun.reflect.DelegatingMethodAcc= essorImpl.invoke(DelegatingMethodAccessorImpl.java:43) =C2=A0=C2=A0 =C2=A0.= .. 21 more 2018-01-16 17:00:13,396 WARN=C2=A0 [o.a.c.alerts] (pool-5-thread= -7:null) (logid:4f7299f6) AlertType:: 30 | dataCenterId:: 1 | podId:: 1 | c= lusterId:: null | message:: HA Fencing of host id=3D1, in dc id=3D1 perform= ed 2018-01-16 17:00:15,375 DEBUG [c.c.a.t.Request] (pool-2-thread-27:null) = (logid:6b21a8c1) Seq 5-9115285645797884785: Sending=C2=A0 \{ Cmd , MgmtId: = 161334379813, via: 5(hv03.cloud.local), Ver: v1, Flags: 100011, [{"com.clou= d.agent.api.CheckOnHostCommand":{"host":{"guid":"598d48ef-158d-3e14-ad68-8d= 02c9368ddf-LibvirtComputingResource","privateNetwork":{"ip":"172.16.25.101"= ,"netmask":"255.255.255.240","mac":"0c:c4:7a:40:8e:f6","isSecurityGroupEnab= led":false},"publicNetwork":\{"ip":"172.16.25.101","netmask":"255.255.255.2= 40","mac":"0c:c4:7a:40:8e:f6","isSecurityGroupEnabled":false},"storageNetwo= rk1":\{"ip":"172.16.25.101","netmask":"255.255.255.240","mac":"0c:c4:7a:40:= 8e:f6","isSecurityGroupEnabled":false}},"wait":20}}] } 2018-01-16 17:00:15,= 380 DEBUG [c.c.a.t.Request] (pool-2-thread-5:null) (logid:bb993597) Seq 4-6= 582855280332112812: Sending=C2=A0 \{ Cmd , MgmtId: 161334379813, via: 4(hv0= 2.cloud.local), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckOnHostC= ommand":{"host":{"guid":"6ebb3010-9c49-3a9c-b620-ecbc9731aca2-LibvirtComput= ingResource","privateNetwork":{"ip":"172.16.25.100","netmask":"255.255.255.= 240","mac":"0c:c4:7a:40:8e:8e","isSecurityGroupEnabled":false},"publicNetwo= rk":\{"ip":"172.16.25.100","netmask":"255.255.255.240","mac":"0c:c4:7a:40:8= e:8e","isSecurityGroupEnabled":false},"storageNetwork1":\{"ip":"172.16.25.1= 00","netmask":"255.255.255.240","mac":"0c:c4:7a:40:8e:8e","isSecurityGroupE= nabled":false}},"wait":20}}] } 2018-01-16 17:00:15,423 DEBUG [c.c.a.t.Reque= st] (AgentManager-Handler-4:null) (logid:) Seq 5-9115285645797884785: Proce= ssing:=C2=A0 \{ Ans: , MgmtId: 161334379813, via: 5, Ver: v1, Flags: 10, [{= "com.cloud.agent.api.Answer":{"result":false,"details":"Heart is beating...= ","wait":0}}] } 2018-01-16 17:00:15,423 DEBUG [c.c.a.t.Request] (pool-2-thr= ead-27:null) (logid:6b21a8c1) Seq 5-9115285645797884785: Received:=C2=A0 \{= Ans: , MgmtId: 161334379813, via: 5(hv03.cloud.local), Ver: v1, Flags: 10,= { Answer } } 2018-01-16 17:00:15,423 DEBUG [c.c.a.m.AgentManagerImpl] (poo= l-2-thread-27:null) (logid:6b21a8c1) Details from executing class com.cloud= .agent.api.CheckOnHostCommand: Heart is beating... 2018-01-16 17:00:15,427 = DEBUG [c.c.a.t.Request] (AgentManager-Handler-6:null) (logid:) Seq 4-658285= 5280332112812: Processing:=C2=A0 \{ Ans: , MgmtId: 161334379813, via: 4, Ve= r: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":false,"details":= "Heart is beating...","wait":0}}] } 2018-01-16 17:00:15,427 DEBUG [c.c.a.t.= Request] (pool-2-thread-5:null) (logid:bb993597) Seq 4-6582855280332112812:= Received:=C2=A0 \{ Ans: , MgmtId: 161334379813, via: 4(hv02.cloud.local), = Ver: v1, Flags: 10, { Answer } } 2018-01-16 17:00:15,427 DEBUG [c.c.a.m.Age= ntManagerImpl] (pool-2-thread-5:null) (logid:bb993597) Details from executi= ng class com.cloud.agent.api.CheckOnHostCommand: Heart is beating... 2018-0= 1-16 17:00:16,217 INFO=C2=A0 [o.a.c.f.j.i.AsyncJobManagerImpl] (AsyncJobMgr= -Heartbeat-1:ctx-d9c2c841) (logid:1b093681) Begin cleanup expired async-job= s 2018-01-16 17:00:16,218 INFO=C2=A0 [o.a.c.f.j.i.AsyncJobManagerImpl] (Asy= ncJobMgr-Heartbeat-1:ctx-d9c2c841) (logid:1b093681) End cleanup expired asy= nc-jobs 2018-01-16 17:00:17,392 WARN=C2=A0 [o.a.c.o.PowerOperationTask] (po= ol-6-thread-29:null) (logid:f9788c38) Out-of-band management background tas= k operation=3DSTATUS for host id=3D1 failed with: Out-of-band Management ac= tion (STATUS) on host (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) failed with er= ror: Get Auth Capabilities error Error issuing Get Channel Authentication C= apabilities request Error: Unable to establish IPMI v2 / RMCP+ session 2018= -01-16 17:00:17,422 DEBUG [o.a.c.o.OutOfBandManagementServiceImpl] (pool-5-= thread-6:ctx-65225bcc) (logid:665de20f) Out-of-band Management action (OFF)= on host (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) failed with error: Get Auth= Capabilities error Error issuing Get Channel Authentication Capabilities r= equest Error: Unable to establish IPMI v2 / RMCP+ session 2018-01-16 17:00:= 17,438 WARN=C2=A0 [o.a.c.k.h.KVMHAProvider] (pool-5-thread-6:ctx-65225bcc) = (logid:665de20f) OOBM service is not configured or enabled for this host hv= 01.cloud.local error is Out-of-band Management action (OFF) on host (57bf86= e0-e1cd-484e-a4f1-78b3ca2da125) failed with error: Get Auth Capabilities er= ror Error issuing Get Channel Authentication Capabilities request Error: Un= able to establish IPMI v2 / RMCP+ session 2018-01-16 17:00:17,438 WARN=C2= =A0 [o.a.c.h.t.BaseHATask] (pool-5-thread-9:null) (logid:ff44841a) Exceptio= n occurred while running FenceTask on a resource: org.apache.cloudstack.ha.= provider.HAFenceException: OOBM service is not configured or enabled for th= is host hv01.cloud.local org.apache.cloudstack.ha.provider.HAFenceException= : OOBM service is not configured or enabled for this host hv01.cloud.local = =C2=A0=C2=A0 =C2=A0at org.apache.cloudstack.kvm.ha.KVMHAProvider.fence(KVMH= AProvider.java:99) =C2=A0=C2=A0 =C2=A0at org.apache.cloudstack.kvm.ha.KVMHA= Provider.fence(KVMHAProvider.java:42) =C2=A0=C2=A0 =C2=A0at org.apache.clou= dstack.ha.task.FenceTask.performAction(FenceTask.java:42) =C2=A0=C2=A0 =C2= =A0at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:86) = =C2=A0=C2=A0 =C2=A0at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseH= ATask.java:83) =C2=A0=C2=A0 =C2=A0at java.util.concurrent.FutureTask.run(Fu= tureTask.java:266) =C2=A0=C2=A0 =C2=A0at java.util.concurrent.ThreadPoolExe= cutor.runWorker(ThreadPoolExecutor.java:1149) =C2=A0=C2=A0 =C2=A0at java.ut= il.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) = =C2=A0=C2=A0 =C2=A0at java.lang.Thread.run(Thread.java:748) Caused by: com.= cloud.utils.exception.CloudRuntimeException: Out-of-band Management action = (OFF) on host (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) failed with error: Get= Auth Capabilities error Error issuing Get Channel Authentication Capabilit= ies request Error: Unable to establish IPMI v2 / RMCP+ session =C2=A0=C2=A0= =C2=A0at org.apache.cloudstack.outofbandmanagement.OutOfBandManagementServ= iceImpl.executePowerOperation(OutOfBandManagementServiceImpl.java:423) =C2= =A0=C2=A0 =C2=A0at sun.reflect.GeneratedMethodAccessor199.invoke(Unknown So= urce) =C2=A0=C2=A0 =C2=A0at sun.reflect.DelegatingMethodAccessorImpl.invoke= (DelegatingMethodAccessorImpl.java:43) =C2=A0=C2=A0 =C2=A0... 21 more 2018-= 01-16 17:00:17,439 WARN=C2=A0 [o.a.c.alerts] (pool-5-thread-9:null) (logid:= ff44841a) AlertType:: 30 | dataCenterId:: 1 | podId:: 1 | clusterId:: null = | message:: HA Fencing of host id=3D1, in dc id=3D1 performed 2018-01-16 17= :00:17,903 DEBUG [o.a.c.s.SecondaryStorageManagerImpl] (secstorage-1:ctx-cc= b33721) (logid:722404aa) Zone 1 is ready to launch secondary storage VM 201= 8-01-16 17:00:17,935 DEBUG [c.c.c.ConsoleProxyManagerImpl] (consoleproxy-1:= ctx-22a69a02) (logid:393fab21) Zone 1 is ready to launch console proxy > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)