cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-10246) VM HA issues
Date Fri, 02 Mar 2018 08:20:01 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383336#comment-16383336
] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
---------------------------------------------

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host HA and
VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171784055
 
 

 ##########
 File path: engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##########
 @@ -843,72 +846,103 @@ protected boolean handleDisconnectWithInvestigation(final AgentAttache
attache,
                 s_logger.debug("Caught exception while getting agent's next status", ne);
             }
 
+            // For log and alert purposes later
+            final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+            final HostPodVO podVO = _podDao.findById(host.getPodId());
+            final String hostDesc = "[name: " + host.getName() + " (id:" + host.getId() +
"), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName() + "]";
+            final String hostShortDesc = "Host " + host.getName() + " (id:" + host.getId()
+ ")";
+
+            final ResourceState resourceState = host.getResourceState();
+            if (resourceState == ResourceState.Disabled || resourceState == ResourceState.Maintenance
|| resourceState == ResourceState.ErrorInMaintenance) {
+                // If we are in this resourceState, no need to investigate or do anything.
 AgentMonitor will handle when in these resourceStates
+                s_logger.info(hostShortDesc + " has disconnected with event " + event + ",
 but is in Resource State of " + resourceState + ", so doing nothing");
+                return true;
+            }
+
             if (nextStatus == Status.Alert) {
-                /* OK, we are going to the bad status, let's see what happened */
-                s_logger.info("Investigating why host " + hostId + " has disconnected with
event " + event);
+                /* Our next Agent transition state is Alert
+                 * Let's see if the host down or why we had this event
+                 */
+                s_logger.info("Investigating why host " + hostShortDesc + " has disconnected
with event " + event);
 
 Review comment:
   👍 good improvement, but though it is only (a comment and) a log statement, this entails
an interface of the system. the ecosystem may query logs for the text and no longer find the
hostId thus not being able to take mitigating actions any more. I'd rather see a less destructive
change like 'hostId + " (" + hostShortDesc + ") "'
   
   We may get away with it but it does require extensive testing by the whole community :/.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> VM HA issues
> ------------
>
>                 Key: CLOUDSTACK-10246
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.11.0.0
>         Environment: My setup is CentOS 7 Management server with 3 CentOS 7 KVM HVs,
NFS as primary and secondary storages.
>            Reporter: Nux
>            Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the instances until
the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as "Alert" or
"Disconnected" respectively. It should get changed to "Down" after that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message