cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down
Date Mon, 15 Sep 2014 15:42:34 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134023#comment-14134023
] 

ASF subversion and git services commented on CLOUDSTACK-7184:
-------------------------------------------------------------

Commit b0641a7d279734970577a3a87940abd030a6a8c2 in cloudstack's branch refs/heads/hotfix/4.4/CLOUDSTACK-7184
from [~dahn]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=b0641a7 ]

CLOUDSTACK-7184 timeout configuration value for host check

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when
host is marked down
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7184
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Hypervisor Controller, Management Server, XenServer
>    Affects Versions: 4.3.0, 4.4.0, 4.5.0
>         Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>            Reporter: Remi Bergsma
>            Assignee: Daan Hoogland
>            Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did discover
this and marked the host as down, and immediately started HA. Just 18 seconds later the hypervisor
returned and we ended up with 5 vm's that were running on two hypervisors at the same time.

> This, of course, resulted in file system corruption and the loss of the vm's. One side
of the story is why XenServer allowed this to happen (will not bother you with this one).
The CloudStack side of the story: HA should only start after at least xen.heartbeat.interval
seconds. If the host is down long enough, the Xen heartbeat script will fence the hypervisor
and prevent corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] (DirectAgent-122:ctx-690badc5)
Unable to get current status on 505(mccpvmXX)
> .....
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-11b9af3e)
Host is down: 505-mccpvmXX.  Starting HA on the VMs
> .....
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager Timer:ctx-0e00979c)
Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 505, name
= mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up:     2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message