cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koushik Das (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-3954) HA with Security Groups and ping disabled will cause split-brian
Date Mon, 05 Aug 2013 18:26:48 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729759#comment-13729759
] 

Koushik Das commented on CLOUDSTACK-3954:
-----------------------------------------

Hi Lennert,

Thanks for posting the the logs and good analysis.

About the fencer returning that VM is successfully fenced.
There is a heartbeat file for each KVM host that is updated on a regular basis by the agent
running on that host. Now the FenceCommand is send to peer hosts in the cluster that are in
'Up' state (based on the logs there seems to be another host with id = 1). What the fence
command does is to check if the heartbeat file is getting updated regularly. In case there
is no update for more than 1 minute (hardcoded in the code), the heartbeat for the host is
considered to be dead and based on that the FenceCommand returns success. Based on that CS
assumes that the VM has been successfully fenced off.

Now the logic fencing is not right in this case as the KVM agent on the host is dead, and
so the hearbeat file will no longer be updated.

I don't think this code has changed in the recent past and so most probably the problem was
there previously as well. One option you have suggested is to not do HA if the host is alive.
When you say 'hypervisor shows any sign of life' how do you intend to determine this? Is this
by pinging the host? This option also has its problems what if the host is still alive but
is not pinging. I guess the possible ways to fix this needs to be discussed more broadly in
the dev list.


                
> HA with Security Groups and ping disabled will cause split-brian
> ----------------------------------------------------------------
>
>                 Key: CLOUDSTACK-3954
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3954
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM
>    Affects Versions: 4.1.0
>         Environment: Tested this with CS 4.1 on Ubuntu, but will probably exist in other
versions
>            Reporter: Lennert den Teuling
>            Assignee: Koushik Das
>            Priority: Critical
>             Fix For: 4.2.0
>
>
> We found out that when running CS 4.1 on KVM with Security Groups enabled + ping disabled
(default) will cause a split-brain when agent crashes. 
> How to reproduce:
> 1. Setup a Basic Zone with SG enabled
> 2. Create one or multiple  HA-enabled VMs with a security group which does not allow
ping (by default). 
> 3. Kill the agent on one of the hosts
> When you do this, the HA component on the management server will restart all VMs on another
node, even when they are running and the VM host is still pingable. This will likely corrupt
all VMs on the host where the agent was stopped/killed. 
> We had some issues with libvirt causing the agent to disconnect. Luckily some VMs allowed
ping so nothing bad happened.  
> Temporary fix:
> Ensure at least one of the running VMs on each hosts allows ping, so the HA manager will
be able to ping it and will not HA the host. 
> I'm not sure yet why this happens, but wanted to file this bug so people can take necessary
preparations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message