cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brenn Oosterbaan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down
Date Wed, 10 Sep 2014 08:10:28 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128206#comment-14128206
] 

Brenn Oosterbaan edited comment on CLOUDSTACK-7184 at 9/10/14 8:09 AM:
-----------------------------------------------------------------------

"I've seen similar with KVM - I'm not sure this is necessarily tied to Xen? I'd suggest that
possibly CS be a little more thorough before deciding a VM is down...maybe via channels other
than the agent/VR?"

John is right on the money here. Although the patch comitted by Daan does give the possibility
to specify a check interval for the Xen storage heartbeat script (instead of using the default
of 5 seconds) it is not the root cause of this issue.

There are two mechanisms at work here. The xen heartbeat script which checks if the storage
is reachable on a specific hypervisors, and Cloudstack which determines if a hypervisor is
up or not.

When we set the Xen heartbeat interval to 180 seconds we basicly said: it's ok for vm's living
on a hypervisor to 'hang' for 180 seconds in case of storage fail-overs or other issues.
Cloudstack has its own checking mechanisms to determine if a hypervisor is down or not. Those
checks are not in line with the xen heartbeat interval. Which means that even though we decided
180 seconds of unavailability is fine, Cloudstack tries to connect to the hypervisor 3 times
(in ~30 seconds) and then decides it is down and starts the vm's on another hypervisor. 
That is the issue/bug Remi meant to identify when filing this ticket.

I personally feel there should be two additional options: hypervisor.heartbeat.interval and
hypervisor.heartbeat.max_retry.
This would allow us to decide to (for instance) set the interval to 15 seconds and the max_retry
to 12. Which would then add up to 180 seconds as well. 
Since the default heartbeat timeout is 60 seconds I would set the defaults for these to a
combination which allows for 60 seconds as well. Otherwise you will never be sure the hypervisor
it self has actually rebooted and thus VM corruption could still take place.

regards,

Brenn


was (Author: boosterbaan@schubergphilis.com):
"I've seen similar with KVM - I'm not sure this is necessarily tied to Xen? I'd suggest that
possibly CS be a little more thorough before deciding a VM is down...maybe via channels other
than the agent/VR?"

John is right on the money here. Although the patch comitted by Daan does give the possibility
to specify a check interval for the Xen storage heartbeat script (instead of using the default
of 5 seconds) it is not the root cause of this issue.

There are two mechanisms at work here. The xen heartbeat script which checks if the storage
is reachable on a specific hypervisors, and Cloudstack which determines if a hypervisor is
up or not.

When we set the Xen heartbeat interval to 180 seconds we basicly said: it's ok for vm's living
on a hypervisor to 'hang' for 180 seconds in case of storage fail-overs or other issues.
Cloudstack has it's own checking mechanisms to determine if a hypervisor is down or not. Those
checks are not in line with the xen heartbeat interval. Which means that even though we decided
180 seconds of unavailability is fine, Cloudstack tries to connect to the hypervisor 3 times
(in ~30 seconds) and then decides it is down and starts the vm's on another hypervisor. 
That is the issue/bug Remi meant to identify when filing this ticket.

I personally feel there should be two additional global options: hypervisor.heartbeat.interval
and hypervisor.heartbeat.max_retry.
This would allow us to decide to (for instance) set the interval to 15 seconds and the max_retry
to 12. Which would then add up to 180 seconds as well. 
Since the default heartbeat timeout is 60 seconds I would set the defaults for these to a
combination which allows for 60 seconds as well. Otherwise you will never be sure the hypervisor
it self has actually rebooted and thus VM corruption could still take place.

regards,

Brenn

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when
host is marked down
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7184
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Hypervisor Controller, Management Server, XenServer
>    Affects Versions: 4.3.0, 4.4.0, 4.5.0
>         Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>            Reporter: Remi Bergsma
>            Assignee: Daan Hoogland
>            Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did discover
this and marked the host as down, and immediately started HA. Just 18 seconds later the hypervisor
returned and we ended up with 5 vm's that were running on two hypervisors at the same time.

> This, of course, resulted in file system corruption and the loss of the vm's. One side
of the story is why XenServer allowed this to happen (will not bother you with this one).
The CloudStack side of the story: HA should only start after at least xen.heartbeat.interval
seconds. If the host is down long enough, the Xen heartbeat script will fence the hypervisor
and prevent corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] (DirectAgent-122:ctx-690badc5)
Unable to get current status on 505(mccpvmXX)
> .....
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-11b9af3e)
Host is down: 505-mccpvmXX.  Starting HA on the VMs
> .....
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager Timer:ctx-0e00979c)
Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 505, name
= mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up:     2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message