cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gerard Lynch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-3421) When hypervisor is down, no HA occurs with log output "Agent state cannot be determined, do nothing"
Date Thu, 25 Jul 2013 08:05:50 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719365#comment-13719365
] 

Gerard Lynch commented on CLOUDSTACK-3421:
------------------------------------------

Closing this issue as its a duplicate of 3535 - which has all the visibility.
                
> When hypervisor is down, no HA occurs with log output "Agent state cannot be determined,
do nothing"
> ----------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-3421
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3421
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM, Management Server
>    Affects Versions: 4.1.0
>         Environment: CentOS 6.4 minimal install
> Libvirt, KVM/Qemu
> CloudStack 4.1
> GlusterFS 3.2, replicated+distributed as primary storage via Shared Mount Point
> 3 physical servers
> * 1 management server, running NFS secondary storage
> ** 1 nic, management+storage
> * 2 hypervisor nodes, running glusterfs-server 
> ** 4x nic, management+storage, public, guest, gluster peering
> * Advanced zone
> * KVM
> * 4 networks: 
>  eth0: cloudbr0: management+secondary storage, 
>  eth2: cloudbr1: public
>  eth3: cloudbr2: guest
>  eth1: gluster peering
> * Shared Mount Point
> * custom network offering with redundant routers enabled
> * global settings tweaked to increase speed of identifying down state
> ** ping.interval: 10sec
>            Reporter: Gerard Lynch
>            Priority: Critical
>             Fix For: 4.1.1, 4.2.0, Future
>
>         Attachments: catalina_management-server.zip
>
>
> We wanted to test CloudStack's HA capabilities by simulating outages to find out how
long it would take to recover.  One of the tests was simulating loss of a hypervisor node
by shutting it down.   When we tested this, we found that CloudStack failed to bring up any
of the VMs (System or Instance), which were on the down node, until the node was powered back
up and reconnected.
> In the logs, we see repeating occurances of:
> INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list for exceptions
> INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-10:) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list for exceptions
> WARN  [agent.manager.AgentAttache] (AgentTaskPool-11:) Seq 14-660013135: Timed out on
Seq 14-660013135:  { Cmd , MgmtId: 93515041483, via: 14, Ver: v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}]
}
> WARN  [agent.manager.AgentAttache] (AgentTaskPool-10:) Seq 15-1097531400: Timed out on
Seq 15-1097531400:  { Cmd , MgmtId: 93515041483, via: 15, Ver: v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}]
}
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Operation timed out: Commands
660013135 to Host 14 timed out after 100
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Operation timed out: Commands
1097531400 to Host 15 timed out after 100
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Agent state cannot be determined,
do nothing
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Agent state cannot be determined,
do nothing
> To reproduce: 
> 1. Build the environment as detailed above
> 2. Register an ISO
> 3. Create a new guest network using the custom network offering (that offers redundant
routers)
> 3. Provision an instance
> 4. Ensure the system VMs and instance are on the first hypervisor node
> 5. Shutdown the first hypervisor node (or pull the plug)
> Expected result:
>   All system VMs and instance(s) should be brought up on the 2nd hypervisor node.
> Actual result:
>   We see the first hypervisor node marked "disconnected."
>   All System VMs and the Instance are still marked "Running", however ping to any of
them fails. 
>   Ping to the redundant router on the 2nd hypervisor node is still working.
>   We see in the logs 
>   "INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list for exceptions"
>   Followed by
>   "Agent state cannot be determined, do nothing"
> Searching for "Cloudstack Agent state cannot be determined, do nothing" lead to: CLOUDSTACK-803
- https://reviews.apache.org/r/8853/
> Which caused me some concern, because if I read the logic in the ticket correctly...
The management server will not perform any HA actions if it's unable to determine the state
of a hypervisor node.  In the scenario above, it's not a loss of connectivity, but an actual
outage on the hypervisor... so I'd rather like HA to occur.  Split brain is a concern, but
I think that something along the lines of "if hypervisor can't see management or gateway,
stop instances)" is more relevant than "do nothing"
> I'm hoping this is something really obvious and simple to resolve, because otherwise
this is a pretty serious issue as currently any accidental shutdown, or hardware fault will
cause a continuous outage requiring manual action to resolve.
> Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message