cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-10246) VM HA issues
Date Fri, 02 Mar 2018 08:56:00 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383366#comment-16383366
] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
---------------------------------------------

Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA
issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171791837
 
 

 ##########
 File path: engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##########
 @@ -843,72 +846,103 @@ protected boolean handleDisconnectWithInvestigation(final AgentAttache
attache,
                 s_logger.debug("Caught exception while getting agent's next status", ne);
             }
 
+            // For log and alert purposes later
+            final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+            final HostPodVO podVO = _podDao.findById(host.getPodId());
+            final String hostDesc = "[name: " + host.getName() + " (id:" + host.getId() +
"), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName() + "]";
+            final String hostShortDesc = "Host " + host.getName() + " (id:" + host.getId()
+ ")";
+
+            final ResourceState resourceState = host.getResourceState();
+            if (resourceState == ResourceState.Disabled || resourceState == ResourceState.Maintenance
|| resourceState == ResourceState.ErrorInMaintenance) {
+                // If we are in this resourceState, no need to investigate or do anything.
 AgentMonitor will handle when in these resourceStates
+                s_logger.info(hostShortDesc + " has disconnected with event " + event + ",
 but is in Resource State of " + resourceState + ", so doing nothing");
+                return true;
+            }
+
             if (nextStatus == Status.Alert) {
-                /* OK, we are going to the bad status, let's see what happened */
-                s_logger.info("Investigating why host " + hostId + " has disconnected with
event " + event);
+                /* Our next Agent transition state is Alert
+                 * Let's see if the host down or why we had this event
+                 */
+                s_logger.info("Investigating why host " + hostShortDesc + " has disconnected
with event " + event);
 
                 Status determinedState = investigate(attache);
                 // if state cannot be determined do nothing and bail out
                 if (determinedState == null) {
                     if ((System.currentTimeMillis() >> 10) - host.getLastPinged() >
AlertWait.value()) {
-                        s_logger.warn("Agent " + hostId + " state cannot be determined for
more than " + AlertWait + "(" + AlertWait.value() + ") seconds, will go to Alert state");
+                        s_logger.warn("State for " + hostShortDesc + " could not be determined
for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, will go to Alert state");
                         determinedState = Status.Alert;
                     } else {
-                        s_logger.warn("Agent " + hostId + " state cannot be determined, do
nothing");
+                        s_logger.warn("State for " + hostShortDesc + " could not be determined,
doing nothing");
                         return false;
                     }
                 }
 
                 final Status currentStatus = host.getStatus();
-                s_logger.info("The agent from host " + hostId + " state determined is " +
determinedState);
+                s_logger.info("Status for " + hostShortDesc + " was " + currentStatus + ".
 Investigation determined the current state is " + determinedState);
 
-                if (determinedState == Status.Down) {
-                    final String message = "Host is down: " + host.getId() + "-" + host.getName()
+ ". Starting HA on the VMs";
-                    s_logger.error(message);
-                    if (host.getType() != Host.Type.SecondaryStorage && host.getType()
!= Host.Type.ConsoleProxy) {
-                        _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), "Host down, " + host.getId(), message);
-                    }
-                    event = Status.Event.HostDown;
-                } else if (determinedState == Status.Up) {
-                    /* Got ping response from host, bring it back */
-                    s_logger.info("Agent is determined to be up and running");
+                if (determinedState == Status.Up) {
+                    // Got ping response from host, bring it back
+                    s_logger.info(hostShortDesc + " is up again");
                     agentStatusTransitTo(host, Status.Event.Ping, _nodeId);
-                    return false;
                 } else if (determinedState == Status.Disconnected) {
-                    s_logger.warn("Agent is disconnected but the host is still up: " + host.getId()
+ "-" + host.getName());
+                    // Investigation says host isn't down, just disconnected
                     if (currentStatus == Status.Disconnected) {
+                        // Last status was disconnected, only switch status if AlertWait
has passed
                         if ((System.currentTimeMillis() >> 10) - host.getLastPinged()
> AlertWait.value()) {
-                            s_logger.warn("Host " + host.getId() + " has been disconnected
past the wait time it should be disconnected.");
-                            event = Status.Event.WaitedTooLong;
+                                s_logger.error("The agent on " + hostShortDesc + " has been
disconnected longer than " + AlertWait + " (" + AlertWait.value() + " seconds). Setting event
to WaitedTooLong");
+                                if (host.getType() != Host.Type.SecondaryStorage &&
host.getType() != Host.Type.ConsoleProxy) {
+                                _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST,
host.getDataCenterId(), host.getPodId(), hostShortDesc + " is in Alert status",
+                                                "The agent for host " + hostDesc + " has
been disconnected longer than " + AlertWait + " (" + AlertWait.value() + " seconds), host
will be put into Alert status.");
+                                }
+                                event = Status.Event.WaitedTooLong;  // Will put into Alert
status at transition
                         } else {
-                            s_logger.debug("Host " + host.getId() + " has been determined
to be disconnected but it hasn't passed the wait time yet.");
+                            // Host hasn't been disconnected long enough to change status
to Alert
+                            s_logger.warn(hostShortDesc + " has been disconnected for " +
((System.currentTimeMillis() >> 10) - host.getLastPinged()) + " seconds, but for less
than " + AlertWait + " (" + AlertWait.value() + " seconds).  No action taken.");
                             return false;
                         }
                     } else if (currentStatus == Status.Up) {
-                        final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
-                        final HostPodVO podVO = _podDao.findById(host.getPodId());
-                        final String hostDesc = "name: " + host.getName() + " (id:" + host.getId()
+ "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName();
-                        if (host.getType() != Host.Type.SecondaryStorage && host.getType()
!= Host.Type.ConsoleProxy) {
-                            _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), "Host disconnected, " + hostDesc,
-                                            "If the agent for host [" + hostDesc + "] is
not restarted within " + AlertWait + " seconds, host will go to Alert state");
-                        }
-                        event = Status.Event.AgentDisconnected;
+                        // Host was up, but now agent is disconnected
+                        // If host stays disconnected, it will be handled again next time
it is investigated
+                        s_logger.warn(hostShortDesc + " was up but is now disconnected. 
Setting event to AgentUnreachable.");
+                        event = Status.Event.AgentUnreachable;  // Will put into Disconnected
status at transition
+                        _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), hostShortDesc + " is " + determinedState,
+                                        "The host " + hostDesc + " was " + currentStatus
+ ", but has now the host agent is in " + Status.Disconnected + ".  If the host is disconnected
longer than " + AlertWait + " (" + AlertWait.value() + " seconds), it will be put into Alert
status.");
+                    } else if (currentStatus == Status.Alert) {
+                        s_logger.error(hostShortDesc + " was in and is still in " + Status.Alert
+ ".  No action taken.");
+                        return false;
+                    }
+                    else {
+                        // If we are here, host was in another status, but next status is
alert so let's alert
+                        s_logger.error(hostShortDesc + " was in " + currentStatus + ", and
investigation determined it is " + Status.Disconnected);
+                        _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), hostShortDesc + " is " + determinedState, "The host " + hostDesc + " was
in " + currentStatus + ", and investigation determined it is " + Status.Disconnected);
                     }
+                } else if (determinedState == Status.Down) {
+                    // Host was determined down - not just disconnected
+                    s_logger.error(hostShortDesc + " is down.  Setting event to HostDown
and will start HA on the VMs");
+                    if (host.getType() != Host.Type.SecondaryStorage && host.getType()
!= Host.Type.ConsoleProxy) {
+                        _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), hostShortDesc + " is down", "Host " + hostDesc + " is down. Will start HA
on the VMs");
+                    }
+                    else {
+                        _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), hostShortDesc + " is down", "Host " + hostDesc + " is down.");
+                    }
+                    event = Status.Event.HostDown;
+                    removeAgent = true;
+                } else if (determinedState == Status.Alert){
+                    // This is likely a Console Proxy or Secondary Storage VM
+                    s_logger.error("Investigation found " + hostShortDesc + " in state: "
+ determinedState);
+                    _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), "Investigation found " + hostShortDesc + " in " + determinedState, "Investigation
found " + hostShortDesc + " in " + determinedState + " state.  Event is " + event);
                 } else {
-                    // if we end up here we are in alert state, send an alert
-                    final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
-                    final HostPodVO podVO = _podDao.findById(host.getPodId());
-                    final String podName = podVO != null ? podVO.getName() : "NO POD";
-                    final String hostDesc = "name: " + host.getName() + " (id:" + host.getId()
+ "), availability zone: " + dcVO.getName() + ", pod: " + podName;
-                    _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), "Host in ALERT state, " + hostDesc,
-                                    "In availability zone " + host.getDataCenterId() + ",
host is in alert state: " + host.getId() + "-" + host.getName());
+                    // Determined state was not Up, Disconnected, Down, or Alert.  To catch
anything else create alert
+                    s_logger.error("Investigation found " + hostShortDesc + " in unhandled
state: " + determinedState);
+                    _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(),
host.getPodId(), "Investigation found " + hostShortDesc + " in " + determinedState, "Investigation
found " + hostShortDesc + " in unhandled state: " + determinedState);
                 }
             } else {
-                s_logger.debug("The next status of agent " + host.getId() + " is not Alert,
no need to investigate what happened");
+                s_logger.info("The next status of host " + hostShortDesc + " is " + nextStatus
+ ", no need to investigate what happened");
             }
         }
-        handleDisconnectWithoutInvestigation(attache, event, true, true);
-        host = _hostDao.findById(hostId); // Maybe the host magically reappeared?
+
+        handleDisconnectWithoutInvestigation(attache, event, true, removeAgent);
+        host = _hostDao.findById(hostId); // We may have transitioned the status - refresh
 
 Review comment:
   Yea, that sounds good

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> VM HA issues
> ------------
>
>                 Key: CLOUDSTACK-10246
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.11.0.0
>         Environment: My setup is CentOS 7 Management server with 3 CentOS 7 KVM HVs,
NFS as primary and secondary storages.
>            Reporter: Nux
>            Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the instances until
the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as "Alert" or
"Disconnected" respectively. It should get changed to "Down" after that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message