cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Dunaway <>
Subject Re: reconnecting to host in alert state - cloud cocked up
Date Wed, 16 Jan 2013 15:56:46 GMT
Had a somewhat similar case last week where my vmware esxi hypervisor and
vcenter got disconnected. Cloudstack refused to work with the host for
controlling running VM's on the hosts (it could talk to vcenter just fine,
but any communication to the esxi host would result in a timeout to that
host). The Host in cloudstack went into 'Alert' state and we could not do
much of anything with it.

What we did was to cheat a bit and set the host as OK in the cloud.hosts DB
table. Then we could do things like maintenance mode on the host. At that
point Cloudstack started to shutdown machines on my esxi host (It was the
only one in the cluster, so we sort of expected that behavior) so
Cloudstack could obviously talk to vcenter and the esxi host and interact
with VM's to do this... so why the timeouts before?

While your case is different in a few ways, what I would like to bring
forward is that when hosts do 'disconnect' in Cloudstack, Cloudstack itself
does not seem to handle the recovery of the host graciously.

It's always a struggle to recover the host, and in production environments
(this happened to a test environment in my case) totally unacceptable to
not have Cloudstack recover the host rather then sit there doing nothing
with VM's in limbo.

I would suggest that these Cloudstack to Hypervisor failure states be
further tested and made more resilient.

On Wed, Jan 16, 2013 at 10:12 AM, Nik Martin <>wrote:

> Ok, this is a new thread centered on a serious problem in my 3.02 CS
> cloud, running Xenserver 6.02 hosts.  Here is what has transpired so far:
> 1: user reports console proxy not available
> 2: confirm console proxy not available, issue reboot via cloudstack UI
> 3: CS reports VM booted ok, still unavailable
> 4: tried to migrate to different host, VM stuck in migrating state
> 5: Log in to host, list_domains command does not show VM , but shows a
> domain in this state:
> 117 | deadbeef-dead-beef-dead-**beef00000075 | DS
> which is a pretty bad sign that the VM is hung pretty badly.
> 6: attempt to destroy domain according to Citrix Support article:
> /opt/xensource/debug/destroy_**domain -domid 117
> 7: command hangs
> 8: I then restart xe api toolstack, it appears to restart fine. I should
> note that ALL vms are on this host via the "first_fit" vm provisioning
> algorithm
> 9: I attempt to start migrating VMs to two other available hosts in
> preparation for a hard reboot of host
> 10: migrating VMs fails, and host is now in alert state in CS, and CS log
> states that host is unavailable. Force reconnect fails.
> So, here I am, in a production environment with a scenario that the whole
> premise of cloud based computing is specifically designed to address, and
> it is the root cause of the issue it is intended to prevent.
> Do I have any other options to prevent down time? I have exhausted
> everything I know to do.   have already scheduled a maintenance window, and
> fudged the truth to my customers stating that there should be no downtime
> during this window, which I have 0 faith will actually be true.
> --
> Regards,
> Nik
> Nik Martin
> nfina Technologies, Inc.
> + x1003
> Relentless Reliability

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message