From issues-return-90528-archive-asf-public=cust-asf.ponee.io@cloudstack.apache.org  Fri Mar  2 09:20:08 2018
Return-Path: <issues-return-90528-archive-asf-public=cust-asf.ponee.io@cloudstack.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 1343C18062F
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  2 Mar 2018 09:20:07 +0100 (CET)
Received: (qmail 85443 invoked by uid 500); 2 Mar 2018 08:20:07 -0000
Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@cloudstack.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@cloudstack.apache.org>
List-Post: <mailto:issues@cloudstack.apache.org>
List-Id: <issues.cloudstack.apache.org>
Reply-To: dev@cloudstack.apache.org
Delivered-To: mailing list issues@cloudstack.apache.org
Received: (qmail 85425 invoked by uid 500); 2 Mar 2018 08:20:06 -0000
Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org
Received: (qmail 85422 invoked by uid 99); 2 Mar 2018 08:20:06 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2018 08:20:06 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 87F9EC02DA
	for <cloudstack-issues@incubator.apache.org>; Fri,  2 Mar 2018 08:20:06 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -109.51
X-Spam-Level:
X-Spam-Status: No, score=-109.51 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	KAM_SHORT=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001,
	T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id 5hbveiI5Kd7H
	for <cloudstack-issues@incubator.apache.org>;
	Fri,  2 Mar 2018 08:20:05 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 24F8B5F126
	for <cloudstack-issues@incubator.apache.org>; Fri,  2 Mar 2018 08:20:04 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7546DE0950
	for <cloudstack-issues@incubator.apache.org>; Fri,  2 Mar 2018 08:20:02 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 868832479A
	for <cloudstack-issues@incubator.apache.org>; Fri,  2 Mar 2018 08:20:01 +0000 (UTC)
Date: Fri, 2 Mar 2018 08:20:01 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: cloudstack-issues@incubator.apache.org
Message-ID: <JIRA.13132723.1516630306000.314950.1519978801550@Atlassian.JIRA>
In-Reply-To: <JIRA.13132723.1516630306000@Atlassian.JIRA>
References: <JIRA.13132723.1516630306000@Atlassian.JIRA> <JIRA.13132723.1516630306142@jira-lw-us.apache.org>
Subject: [jira] [Commented] (CLOUDSTACK-10246) VM HA issues
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=3Dcom.atl=
assian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=
=3D16383336#comment-16383336 ]=20

ASF GitHub Bot commented on CLOUDSTACK-10246:
---------------------------------------------

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 =
Fix Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171784055
=20
=20

 ##########
 File path: engine/orchestration/src/com/cloud/agent/manager/AgentManagerIm=
pl.java
 ##########
 @@ -843,72 +846,103 @@ protected boolean handleDisconnectWithInvestigation=
(final AgentAttache attache,
                 s_logger.debug("Caught exception while getting agent's nex=
t status", ne);
             }
=20
+            // For log and alert purposes later
+            final DataCenterVO dcVO =3D _dcDao.findById(host.getDataCenter=
Id());
+            final HostPodVO podVO =3D _podDao.findById(host.getPodId());
+            final String hostDesc =3D "[name: " + host.getName() + " (id:"=
 + host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + p=
odVO.getName() + "]";
+            final String hostShortDesc =3D "Host " + host.getName() + " (i=
d:" + host.getId() + ")";
+
+            final ResourceState resourceState =3D host.getResourceState();
+            if (resourceState =3D=3D ResourceState.Disabled || resourceSta=
te =3D=3D ResourceState.Maintenance || resourceState =3D=3D ResourceState.E=
rrorInMaintenance) {
+                // If we are in this resourceState, no need to investigate=
 or do anything.  AgentMonitor will handle when in these resourceStates
+                s_logger.info(hostShortDesc + " has disconnected with even=
t " + event + ",  but is in Resource State of " + resourceState + ", so doi=
ng nothing");
+                return true;
+            }
+
             if (nextStatus =3D=3D Status.Alert) {
-                /* OK, we are going to the bad status, let's see what happ=
ened */
-                s_logger.info("Investigating why host " + hostId + " has d=
isconnected with event " + event);
+                /* Our next Agent transition state is Alert
+                 * Let's see if the host down or why we had this event
+                 */
+                s_logger.info("Investigating why host " + hostShortDesc + =
" has disconnected with event " + event);
=20
 Review comment:
   =F0=9F=91=8D good improvement, but though it is only (a comment and) a l=
og statement, this entails an interface of the system. the ecosystem may qu=
ery logs for the text and no longer find the hostId thus not being able to =
take mitigating actions any more. I'd rather see a less destructive change =
like 'hostId + " (" + hostShortDesc + ") "'
  =20
   We may get away with it but it does require extensive testing by the who=
le community :/.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
=20
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> VM HA issues
> ------------
>
>                 Key: CLOUDSTACK-10246
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-102=
46
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the defa=
ult.)=20
>          Components: Management Server
>    Affects Versions: 4.11.0.0
>         Environment: My setup is CentOS 7 Management server with 3 CentOS=
 7 KVM HVs, NFS as primary and secondary storages.
>            Reporter: Nux
>            Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the =
instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as=
 "Alert" or "Disconnected" respectively. It should get changed to "Down" af=
ter that, but this never happens.
> =C2=A0
> I have tried various combinations of setups (Adv, Basic), none succeeded.
> =C2=A0
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
> =C2=A0
> =C2=A0
> =C2=A0
> =C2=A0
> =C2=A0


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)