cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chiradeep Vittal <>
Subject Re: Help! After network outage, can't start System VMs; focused debug info attached
Date Tue, 17 Sep 2013 00:19:05 GMT
Attachments are stripped. Can you paste (say at

From: Matt Foley <<>>
Reply-To: "<>" <<>>
Date: Monday, September 16, 2013 4:58 PM
To: "<>" <<>>
Subject: Help! After network outage, can't start System VMs; focused debug info attached

We had a planned network outage this weekend, which inadvertently resulted in making the NFS
Shared Primary Storage (used by System VMs) unavailable for a day and a half.  (Guest VMs
use local storage only, but System VMs use shared storage only.)  Cloudstack was not brought
down prior to the outage.

After network came back, we gracefully brought down all services including cloudstack-management,
mysql, and NFS, then actually rebooted all servers in the cluster and the NFS server (to make
sure no stale file handles), then brought up services in the appropriate order.  Also checked
mysql for table corruption, and found none.  Confirmed that the NFS volumes are mountable
from all hosts, and in fact Shared Primary Storage is being mounted by cloudstack on hosts
as usual, under /mnt/<uuid>.

Nevertheless, when try to bring up the cluster, we fail to start the system VMs, with errors
"InsufficientServerCapacityException: Unable to create a deployment for VM".  The cause is
not really insufficient capacity, as actual usage of resources is tiny; these error messages
are false explanations of the failure to create primary storage volume for the System VMs.

Digging into management-server.log, the core issue seems to be the ~160 line snippet from
the log attached to this message as cloudstack_debug_2013.09.16.log.  The only Shared Primary
Storage pool is pool 201, named "cs-primary".  It is mounted on all hosts as /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9,
which is its uuid.  The log shows the management server correctly identifying a particular
host as being able to access pool 201, then trying to allocate a primary storage volume using
the template with uuid f23a16e7-b628-429e-83e1-698935588465.  It fails, but I cannot tell
why.  I suspect its claim that "Template 3 has already been downloaded to pool 201" is false,
but I don't know how to check this (or fix if wrong).

Any guidance for further debugging or fixing this would be GREATLY appreciated.

NOTICE: This message is intended for the use of the individual or entity to which it is addressed
and may contain information that is confidential, privileged and exempt from disclosure under
applicable law. If the reader of this message is not the intended recipient, you are hereby
notified that any printing, copying, dissemination, distribution, disclosure or forwarding
of this communication is strictly prohibited. If you have received this communication in error,
please contact the sender immediately and delete it from your system. Thank You.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message