cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-9458) Some VMs are being stopped when agent is reconnecting
Date Thu, 18 Aug 2016 08:53:20 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426116#comment-15426116
] 

ASF GitHub Bot commented on CLOUDSTACK-9458:
--------------------------------------------

Github user marcaurele commented on the issue:

    https://github.com/apache/cloudstack/pull/1640
  
    I understand your point of the release, but we're not in an ideal world where everyone
runs the latest version. I try to do my best to look at the current code in CS to find possible
fixes of any bug/problem we encounter or changes we want to do in our version. I want us to
get back to the master version but that's not the topic here, neither going to happen in the
next weeks.
    
    The point 2 does not make sense to me. If the management server cannot determine the state
of the VM, it could mark them as stopped (*even though I don't think it should*). But it should
not create a StopVM job, because that might trigger a proper stop of the VM if the agent is
reconnecting while the job is picked by async job workers.
    If the VM is really down because the host has crashed, then the command is pointless,
and in a customer point of view it would not make a difference. If the host is still up and
fine, but we have a network glitch, then requesting a stop of the VM is really bad in a customer
point of view. By not doing anything, not requesting a stop, we would end up in a better situation.
    
    Going back to which state should be set on the VM when the management server cannot determine
it, taking the assumption that the VM is stopped because the management server cannot reach
the agent is as much incorrect as leaving it as it is (running, migrating, creating...). I'd
rather create a new state `UNKNOWN` for such special case, when the management server does
really not know. In a management point of view it will be also easier to know there are *ghost*
VMs somewhere for which the management server cannot determine the exact state and proper
investigation (*manual*) should be done if the state stays like this, regarding the billing
part too.


> Some VMs are being stopped when agent is reconnecting
> -----------------------------------------------------
>
>                 Key: CLOUDSTACK-9458
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-9458
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>            Reporter: Marc-Aurèle Brothier
>            Assignee: Marc-Aurèle Brothier
>
> If you loose the communication between the management server and one of the agent for
a few minutes, even though HA mode is not active the HighAvailibilityManager kicks in and
start to schedule vm restart. Those tasks are being inserted as async job in the DB and if
the agent comes back online during the time the jobs are still in the async table, they are
pushed to the agent and shuts down the VMs. Then since HA is not active, the VM are not restarted.
> The expected behavior in my opinion is that the VM should not be restarted at all if
HA mode is not active on them, and let the agent update the VM state with the power report.
> The bug lies in {{HighAvailibilityManagerImpl.scheduleRestartForVmsOnHost(final HostVO
host, boolean investigate)}}, PR will follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message