cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laurent Steff <Laurent.St...@inria.fr>
Subject outage feedback and questions
Date Mon, 08 Jul 2013 12:39:05 GMT
Hello,

Cloudstack is used in our company as a core component of a "Continuous Integration"
Service.

We are mainly happy with it, for a lot of reasons too long to describe. :)

We encountered recently a major service outage on Cloudstack mainly linked
to bad practices on our side, and the aim of this post is :

- ask questions about things we didn't understand yet
- gather some practical best practices we missed
- if problems detected are still present on Cloudstack 4.x, helping
to robustify Cloudstack with our feedback

we know that 3.x version is not supported and plan to move ASAP in 4.x version.

It's quite a long mail, and it may be badly directed (dev mailing list ? multiple bugs ?)

Any response is appreciated ;)

Regards,


--------------------long part----------------------------------------

Architecture :
--------------

Old and non Apache CloudStack 3.0.2 release
1 Zone, 1 physical network, 1 pod
1 Virtual Router VM, 1 SSVM
4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
Management Server on Vmware virtual machine



Incidents :
-----------

Day 1 : Management Server DoSed by internal synchronization scripts (ldap to Cloudstack)
Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and rebooted (never rebooted
in more than a year). Cloudstack
is running again normally (vm creation/stop/start/console/...)
Day 4 : (week-end) Network outage on core datacenter switch. Network unstable 2 days.

Symptoms :
----------

Day 7 : The network is operationnal but most of VMs down (250 of 300) since Day 4. 
Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).

VirtualRouter VM fileystem was on of them. Filesystem corruption prevented it to reboot normally.

Survivors VMs are on the same KVM/GFS2 Cluster.
SSVM is one of them. Messages on the console indicates she was temporarily in read-only mode

Hard way to revival (actions):
-----------------------------

1. VirtualRouter VM destructed by an administrator, to let CloudStack recreate it from template.

BUT :)

the SystemVM KVM Template is not available. Status in GUI is "CONNECTION REFUSED".
The url from where it was downloaded during install is no more valid (old and unavailable
internal mirror server  instead of http://download.cloud.com)

=> we are unable to start again VMs stopped and create new ones

2. Manual download on the Managment Server of the template, like in a fresh install

---
/usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt -m /mnt/secondary/
 -u http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2
-h kvm -F
---

It's no sufficient. mysql table template_host_ref does not change. Even when changing url
in mysql tables.
We still have "CONNECTION REFUSED" on template status in mysql and on the GUI

3. after analysis, we needed to alter manualy mysql tables (template_id of systemVM KVM was
x) :

---
update template_host_ref set download_state='DOWNLOADED' where template_id=x;
update template_host_ref set job_id='NULL' where template_id=x; <= may be useless
update template_host_ref set job_id='NULL' where template_id=x; <= may be useless
---

4. As in MySQL, status on GUI is DOWNLOADED

5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter VM and we can let users
start manually their stopped VM


Questions :
-----------

1. What did stop and destroyed the libvirt domains of our VMs ? There's some part
of code who could do this, but I'm not sure

2. Is it possible that Cloudstack triggered autonomously the re-download of the 
systemVM template ? Or has it to be an human interaction.

3. In 4.x is the risk of a corrupted, or systemVM template with a bad status
still present. Is there any warning more than a simple "connexion refused" not
really visible as an alert ?

4. Is Cloudstack retrying by default to restart VMs who should be up, or do
we need configuration for this ?


--------------------end of long part----------------------------------------


-- 
Laurent Steff

DSI/SESI
http://www.inria.fr/

Mime
View raw message