Return-Path: X-Original-To: apmail-cloudstack-users-archive@www.apache.org Delivered-To: apmail-cloudstack-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A571810992 for ; Mon, 8 Jul 2013 14:34:50 +0000 (UTC) Received: (qmail 58931 invoked by uid 500); 8 Jul 2013 14:34:49 -0000 Delivered-To: apmail-cloudstack-users-archive@cloudstack.apache.org Received: (qmail 58901 invoked by uid 500); 8 Jul 2013 14:34:49 -0000 Mailing-List: contact users-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@cloudstack.apache.org Delivered-To: mailing list users@cloudstack.apache.org Received: (qmail 58893 invoked by uid 99); 8 Jul 2013 14:34:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jul 2013 14:34:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dean.kamali@gmail.com designates 209.85.128.169 as permitted sender) Received: from [209.85.128.169] (HELO mail-ve0-f169.google.com) (209.85.128.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jul 2013 14:34:45 +0000 Received: by mail-ve0-f169.google.com with SMTP id m1so3627920ves.28 for ; Mon, 08 Jul 2013 07:34:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=+RjS54pl9WWMSs+0AZsrzeshlju4DOzz7sLriT14AeQ=; b=EU5y5CWODEa/xcR/BrzanvcB3V4tPLhikdwotsn8MBKKlkX2z/DcRQThMDWyUiRJKr xUaMPULYJlOkD3SHNxHboDF8CLPPWTCaUa8SN7Xvgugp3KQa1FfkOFsfnmA1m4VBZXXm 3aYS00gKDNkqpziXQvf01Onxa3DXpzZvlFkSQJQQblSBJnH4A0Vd3lvFGZHLDBMtiw3o k0YYA3uOaC4McX3tW7BLS9NLdnPS8FrH9tcGr1X1e/D8fKKVp6i2nWz+mR73+pHppav9 zAPU1Gb+304R77FjQu3uO2hs8q9cCOFBVb7ApwSPhvYZlKbfbd7sdVqtKVRloWvJ/rcL mJYA== X-Received: by 10.52.21.196 with SMTP id x4mr3298620vde.65.1373294064728; Mon, 08 Jul 2013 07:34:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.169.132 with HTTP; Mon, 8 Jul 2013 07:34:04 -0700 (PDT) In-Reply-To: <2038566474.1720901.1373287145479.JavaMail.root@inria.fr> References: <810375240.5138395.1372851441187.JavaMail.root@inria.fr> <2038566474.1720901.1373287145479.JavaMail.root@inria.fr> From: Dean Kamali Date: Mon, 8 Jul 2013 10:34:04 -0400 Message-ID: Subject: Re: outage feedback and questions To: users@cloudstack.apache.org Content-Type: multipart/alternative; boundary=20cf30780d1ae682aa04e100f0d3 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30780d1ae682aa04e100f0d3 Content-Type: text/plain; charset=ISO-8859-1 Survivors VMs are on the same KVM/GFS2 Cluster. SSVM is one of them. Messages on the console indicates she was temporarily in read-only mode Do you have an issue with storage? I wouldn't expect a failure in switch could cause all of this, it will cause loss of network connectivity but it shouldn't cause your vms to go down. This behavior usually happens when you lose your primary storage. On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff wrote: > Hello, > > Cloudstack is used in our company as a core component of a "Continuous > Integration" > Service. > > We are mainly happy with it, for a lot of reasons too long to describe. :) > > We encountered recently a major service outage on Cloudstack mainly linked > to bad practices on our side, and the aim of this post is : > > - ask questions about things we didn't understand yet > - gather some practical best practices we missed > - if problems detected are still present on Cloudstack 4.x, helping > to robustify Cloudstack with our feedback > > we know that 3.x version is not supported and plan to move ASAP in 4.x > version. > > It's quite a long mail, and it may be badly directed (dev mailing list ? > multiple bugs ?) > > Any response is appreciated ;) > > Regards, > > > --------------------long part---------------------------------------- > > Architecture : > -------------- > > Old and non Apache CloudStack 3.0.2 release > 1 Zone, 1 physical network, 1 pod > 1 Virtual Router VM, 1 SSVM > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage > Management Server on Vmware virtual machine > > > > Incidents : > ----------- > > Day 1 : Management Server DoSed by internal synchronization scripts (ldap > to Cloudstack) > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and rebooted > (never rebooted in more than a year). Cloudstack > is running again normally (vm creation/stop/start/console/...) > Day 4 : (week-end) Network outage on core datacenter switch. Network > unstable 2 days. > > Symptoms : > ---------- > > Day 7 : The network is operationnal but most of VMs down (250 of 300) > since Day 4. > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased). > > VirtualRouter VM fileystem was on of them. Filesystem corruption prevented > it to reboot normally. > > Survivors VMs are on the same KVM/GFS2 Cluster. > SSVM is one of them. Messages on the console indicates she was temporarily > in read-only mode > > Hard way to revival (actions): > ----------------------------- > > 1. VirtualRouter VM destructed by an administrator, to let CloudStack > recreate it from template. > > BUT :) > > the SystemVM KVM Template is not available. Status in GUI is "CONNECTION > REFUSED". > The url from where it was downloaded during install is no more valid (old > and unavailable > internal mirror server instead of http://download.cloud.com) > > => we are unable to start again VMs stopped and create new ones > > 2. Manual download on the Managment Server of the template, like in a > fresh install > > --- > /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt > -m /mnt/secondary/ -u > http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h kvm -F > --- > > It's no sufficient. mysql table template_host_ref does not change. Even > when changing url in mysql tables. > We still have "CONNECTION REFUSED" on template status in mysql and on the > GUI > > 3. after analysis, we needed to alter manualy mysql tables (template_id of > systemVM KVM was x) : > > --- > update template_host_ref set download_state='DOWNLOADED' where > template_id=x; > update template_host_ref set job_id='NULL' where template_id=x; <= may be > useless > update template_host_ref set job_id='NULL' where template_id=x; <= may be > useless > --- > > 4. As in MySQL, status on GUI is DOWNLOADED > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter VM and > we can let users > start manually their stopped VM > > > Questions : > ----------- > > 1. What did stop and destroyed the libvirt domains of our VMs ? There's > some part > of code who could do this, but I'm not sure > > 2. Is it possible that Cloudstack triggered autonomously the re-download > of the > systemVM template ? Or has it to be an human interaction. > > 3. In 4.x is the risk of a corrupted, or systemVM template with a bad > status > still present. Is there any warning more than a simple "connexion refused" > not > really visible as an alert ? > > 4. Is Cloudstack retrying by default to restart VMs who should be up, or do > we need configuration for this ? > > > --------------------end of long > part---------------------------------------- > > > -- > Laurent Steff > > DSI/SESI > http://www.inria.fr/ > --20cf30780d1ae682aa04e100f0d3--