cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dpor...@outlook.com>
Subject RE: outage feedback and questions
Date Fri, 19 Jul 2013 18:41:14 GMT
Dean,
     The system I've been working with is a small dev system, so we only have the one appliance
which is most definitely a single point of failure.  It looks like nexenta does have a HA
plugin available for clustering two or more of them, but I don't really know anything about
it other than having just read the intro paragraph on this data sheet.
http://info.nexenta.com/rs/nexenta/images/data_sheet_ha_cluster.pdf
Thanks,     Dave


> From: dean.kamali@gmail.com
> Date: Fri, 19 Jul 2013 12:48:23 -0400
> Subject: Re: outage feedback and questions
> To: users@cloudstack.apache.org
> 
> For primary storage dose nexentastor provide you with HA?
> 
> 
> On Fri, Jul 19, 2013 at 12:09 PM, David Ortiz <dportiz@outlook.com> wrote:
> 
> > Dean,
> >     We didn't really have a recovery plan in place at the time.
> >  Fortunately for us, this was just before we went live for other users to
> > hit our system, so what ended up happening was I was able to compare the
> > mysql database entries for volumes with the list of files that were still
> > present on primary storage.  From there I could figure out which VMs were
> > missing root disks and delete/rebuild them as needed, and then for data
> > volumes that were missing we were able to simply recreate them and go into
> > the instances to reformat and do any other configuration.  Fortunately we
> > had created all the VMs that went down, and I had created base templates
> > for each basic system type we were using (e.g. hadoop node, web server,
> > etc.), so recovery was pretty straightforward.
> > We now have been taking snaphosts of our vms and vendor vms so we can
> > restore from those if things get corrupted.  We also are using nexentastor
> > for our shared storage, which I believe lets you snapshot the entire shared
> > filesystem as well.
> > Thanks,     Dave
> >
> > > Date: Mon, 15 Jul 2013 17:27:24 -0400
> > > Subject: RE: outage feedback and questions
> > > From: dean.kamali@gmail.com
> > > To: users@cloudstack.apache.org
> > >
> > > Just wondering if you had a recovery plan?
> > > Would you please share with us your experience.
> > >
> > > Thank you
> > > On Jul 15, 2013 4:47 PM, "David Ortiz" <dportiz@outlook.com> wrote:
> > >
> > > > Laurent,
> > > >     We too had some issues where we lost VMs after a switch went down.
> >  We
> > > > are also using gfs2 over iScsi for our primary storage.  Once I got the
> > > > cluster back up, fsck found a lot of corruption on the gfs2 fs, which
> > > > resulted in probably 6 VMs out of the 25 we had needing to have volumes
> > > > rebuilt, or having to be rebuilt completely.  I would guess this is
> > what
> > > > happened in your case as well.
> > > > Thanks,     David Ortiz
> > > >
> > > > > From: dean.kamali@gmail.com
> > > > > Date: Tue, 9 Jul 2013 19:35:52 -0400
> > > > > Subject: Re: outage feedback and questions
> > > > > To: users@cloudstack.apache.org
> > > > >
> > > > > courtesy to geoff.higginbottom@shapeblue.comfor answering this
> > question
> > > > first
> > > > >
> > > > >
> > > > > On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <dean.kamali@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Well, I have asked in the mailing list sometime ago, about
> > > > > > cloudstack behaviour when I lose connectively to primary storage,
> > then
> > > > > > hypervisor start rebooting randomly.
> > > > > >
> > > > > > I believe this what is very similar to what happend in your
case.
> > > > > >
> > > > > > This is actually 'by design'.  The logic is that if the storage
> > goes
> > > > > > offline, then all VMs must have also failed, and a 'forced'
reboot
> > of
> > > > the
> > > > > > Host 'might' automatically fix things.
> > > > > >
> > > > > > This is great if you only have one Primary Storage, but typically
> > you
> > > > > > have more than one, so whilst the reboot might fix the failed
> > storage,
> > > > it
> > > > > > will also kill off all the perfectly good VMs which were still
> > happily
> > > > > > running.
> > > > > >
> > > > > > The answer what I got was for xenserver not KVM, it included
> > removing
> > > > the
> > > > > > reboot -f option for a config file.
> > > > > >
> > > > > >
> > > > > >
> > > > > > The fix for XenServer Hosts is to:
> > > > > >
> > > > > > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> > > > > > commenting out the two entries which have "reboot -f"
> > > > > >
> > > > > > 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
> > > > > >
> > > > > > 3. Restart the Script  - kill <pid>
> > > > > >
> > > > > > 4. Force reconnect Host from the UI,  the script will then
> > re-launch on
> > > > > > reconnect
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <
> > Laurent.Steff@inria.fr
> > > > >wrote:
> > > > > >
> > > > > >> Hi Dean,
> > > > > >>
> > > > > >> And thanks for your answer.
> > > > > >>
> > > > > >> Yes the network troubles lead to issue with the main storage
> > > > > >> on clusters (iscsi).
> > > > > >>
> > > > > >> So is that a fact if the main storage is lost on KVM, VMs
are
> > stopped
> > > > > >> and domain destroyed ?
> > > > > >>
> > > > > >> It was an hypothesis as I found traces in
> > > > > >>
> > > > > >>
> > > > > >>
> > > >
> > apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
> > > > > >>
> > > > > >> which "kills -9 qemu processes" if main storage is not found,
but
> > I
> > > > was
> > > > > >> not sure when the function was called.
> > > > > >>
> > > > > >> It's on the function  checkingMountPoint, which calls destroyVMs
> > if
> > > > mount
> > > > > >> point not found.
> > > > > >>
> > > > > >> Regards,
> > > > > >>
> > > > > >> ----- Mail original -----
> > > > > >> > De: "Dean Kamali" <dean.kamali@gmail.com>
> > > > > >> > À: users@cloudstack.apache.org
> > > > > >> > Envoyé: Lundi 8 Juillet 2013 16:34:04
> > > > > >> > Objet: Re: outage feedback and questions
> > > > > >> >
> > > > > >> > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > > > >> > SSVM is one of them. Messages on the console indicates
she was
> > > > > >> > temporarily
> > > > > >> > in read-only mode
> > > > > >> >
> > > > > >> > Do you have an issue with storage?
> > > > > >> >
> > > > > >> > I wouldn't expect a failure in switch could cause all
of this,
> > it
> > > > > >> > will
> > > > > >> > cause loss of network connectivity but it shouldn't
cause your
> > vms
> > > > to
> > > > > >> > go
> > > > > >> > down.
> > > > > >> >
> > > > > >> > This behavior usually happens when you lose your primary
> > storage.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> > > > > >> > <Laurent.Steff@inria.fr>wrote:
> > > > > >> >
> > > > > >> > > Hello,
> > > > > >> > >
> > > > > >> > > Cloudstack is used in our company as a core component
of a
> > > > > >> > > "Continuous
> > > > > >> > > Integration"
> > > > > >> > > Service.
> > > > > >> > >
> > > > > >> > > We are mainly happy with it, for a lot of reasons
too long to
> > > > > >> > > describe. :)
> > > > > >> > >
> > > > > >> > > We encountered recently a major service outage
on Cloudstack
> > > > mainly
> > > > > >> > > linked
> > > > > >> > > to bad practices on our side, and the aim of this
post is :
> > > > > >> > >
> > > > > >> > > - ask questions about things we didn't understand
yet
> > > > > >> > > - gather some practical best practices we missed
> > > > > >> > > - if problems detected are still present on Cloudstack
4.x,
> > > > helping
> > > > > >> > > to robustify Cloudstack with our feedback
> > > > > >> > >
> > > > > >> > > we know that 3.x version is not supported and
plan to move
> > ASAP in
> > > > > >> > > 4.x
> > > > > >> > > version.
> > > > > >> > >
> > > > > >> > > It's quite a long mail, and it may be badly directed
(dev
> > mailing
> > > > > >> > > list ?
> > > > > >> > > multiple bugs ?)
> > > > > >> > >
> > > > > >> > > Any response is appreciated ;)
> > > > > >> > >
> > > > > >> > > Regards,
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --------------------long
> > > > > >> > > part----------------------------------------
> > > > > >> > >
> > > > > >> > > Architecture :
> > > > > >> > > --------------
> > > > > >> > >
> > > > > >> > > Old and non Apache CloudStack 3.0.2 release
> > > > > >> > > 1 Zone, 1 physical network, 1 pod
> > > > > >> > > 1 Virtual Router VM, 1 SSVM
> > > > > >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2
on iscsi
> > storage
> > > > > >> > > Management Server on Vmware virtual machine
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Incidents :
> > > > > >> > > -----------
> > > > > >> > >
> > > > > >> > > Day 1 : Management Server DoSed by internal synchronization
> > > > scripts
> > > > > >> > > (ldap
> > > > > >> > > to Cloudstack)
> > > > > >> > > Day 3 : DoS corrected, Management Server RAM and
CPU ugraded,
> > and
> > > > > >> > > rebooted
> > > > > >> > > (never rebooted in more than a year). Cloudstack
> > > > > >> > > is running again normally (vm creation/stop/start/console/...)
> > > > > >> > > Day 4 : (week-end) Network outage on core datacenter
switch.
> > > > > >> > > Network
> > > > > >> > > unstable 2 days.
> > > > > >> > >
> > > > > >> > > Symptoms :
> > > > > >> > > ----------
> > > > > >> > >
> > > > > >> > > Day 7 : The network is operationnal but most of
VMs down (250
> > of
> > > > > >> > > 300)
> > > > > >> > > since Day 4.
> > > > > >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml
erased).
> > > > > >> > >
> > > > > >> > > VirtualRouter VM fileystem was on of them. Filesystem
> > corruption
> > > > > >> > > prevented
> > > > > >> > > it to reboot normally.
> > > > > >> > >
> > > > > >> > > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > > > >> > > SSVM is one of them. Messages on the console indicates
she was
> > > > > >> > > temporarily
> > > > > >> > > in read-only mode
> > > > > >> > >
> > > > > >> > > Hard way to revival (actions):
> > > > > >> > > -----------------------------
> > > > > >> > >
> > > > > >> > > 1. VirtualRouter VM destructed by an administrator,
to let
> > > > > >> > > CloudStack
> > > > > >> > > recreate it from template.
> > > > > >> > >
> > > > > >> > > BUT :)
> > > > > >> > >
> > > > > >> > > the SystemVM KVM Template is not available. Status
in GUI is
> > > > > >> > > "CONNECTION
> > > > > >> > > REFUSED".
> > > > > >> > > The url from where it was downloaded during install
is no more
> > > > > >> > > valid (old
> > > > > >> > > and unavailable
> > > > > >> > > internal mirror server  instead of http://download.cloud.com)
> > > > > >> > >
> > > > > >> > > => we are unable to start again VMs stopped
and create new
> > ones
> > > > > >> > >
> > > > > >> > > 2. Manual download on the Managment Server of
the template,
> > like
> > > > in
> > > > > >> > > a
> > > > > >> > > fresh install
> > > > > >> > >
> > > > > >> > > ---
> > > > > >> > >
> > > > > >>
> > > >
> > /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> > > > > >> > > -m /mnt/secondary/  -u
> > > > > >> > >
> > > > > >>
> > > >
> > http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> > > > > >> > > kvm -F
> > > > > >> > > ---
> > > > > >> > >
> > > > > >> > > It's no sufficient. mysql table template_host_ref
does not
> > change.
> > > > > >> > > Even
> > > > > >> > > when changing url in mysql tables.
> > > > > >> > > We still have "CONNECTION REFUSED" on template
status in
> > mysql and
> > > > > >> > > on the
> > > > > >> > > GUI
> > > > > >> > >
> > > > > >> > > 3. after analysis, we needed to alter manualy
mysql tables
> > > > > >> > > (template_id of
> > > > > >> > > systemVM KVM was x) :
> > > > > >> > >
> > > > > >> > > ---
> > > > > >> > > update template_host_ref set download_state='DOWNLOADED'
where
> > > > > >> > > template_id=x;
> > > > > >> > > update template_host_ref set job_id='NULL' where
> > template_id=x; <=
> > > > > >> > > may be
> > > > > >> > > useless
> > > > > >> > > update template_host_ref set job_id='NULL' where
> > template_id=x; <=
> > > > > >> > > may be
> > > > > >> > > useless
> > > > > >> > > ---
> > > > > >> > >
> > > > > >> > > 4. As in MySQL, status on GUI is DOWNLOADED
> > > > > >> > >
> > > > > >> > > 5. Poweron of a stopped VM, Cloudstack builds
a new
> > VirtualRouter
> > > > > >> > > VM and
> > > > > >> > > we can let users
> > > > > >> > > start manually their stopped VM
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Questions :
> > > > > >> > > -----------
> > > > > >> > >
> > > > > >> > > 1. What did stop and destroyed the libvirt domains
of our VMs
> > ?
> > > > > >> > > There's
> > > > > >> > > some part
> > > > > >> > > of code who could do this, but I'm not sure
> > > > > >> > >
> > > > > >> > > 2. Is it possible that Cloudstack triggered autonomously
the
> > > > > >> > > re-download
> > > > > >> > > of the
> > > > > >> > > systemVM template ? Or has it to be an human interaction.
> > > > > >> > >
> > > > > >> > > 3. In 4.x is the risk of a corrupted, or systemVM
template
> > with a
> > > > > >> > > bad
> > > > > >> > > status
> > > > > >> > > still present. Is there any warning more than
a simple
> > "connexion
> > > > > >> > > refused"
> > > > > >> > > not
> > > > > >> > > really visible as an alert ?
> > > > > >> > >
> > > > > >> > > 4. Is Cloudstack retrying by default to restart
VMs who
> > should be
> > > > > >> > > up, or do
> > > > > >> > > we need configuration for this ?
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --------------------end of long
> > > > > >> > > part----------------------------------------
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > > Laurent Steff
> > > > > >> > >
> > > > > >> > > DSI/SESI
> > > > > >> > > http://www.inria.fr/
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >> --
> > > > > >> Laurent Steff
> > > > > >>
> > > > > >> DSI/SESI
> > > > > >> INRIA
> > > > > >> Tél.  : +33 1 39 63 50 81
> > > > > >> Port. : +33 6 87 66 77 85
> > > > > >> http://www.inria.fr/
> > > > > >>
> > > > > >
> > > > > >
> > > >
> >
> >
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message