Return-Path: X-Original-To: apmail-cloudstack-users-archive@www.apache.org Delivered-To: apmail-cloudstack-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7056C10EF8 for ; Tue, 4 Mar 2014 12:56:19 +0000 (UTC) Received: (qmail 45581 invoked by uid 500); 4 Mar 2014 12:56:17 -0000 Delivered-To: apmail-cloudstack-users-archive@cloudstack.apache.org Received: (qmail 45549 invoked by uid 500); 4 Mar 2014 12:56:16 -0000 Mailing-List: contact users-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@cloudstack.apache.org Delivered-To: mailing list users@cloudstack.apache.org Received: (qmail 45534 invoked by uid 99); 4 Mar 2014 12:56:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Mar 2014 12:56:15 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of andrei@arhont.com designates 37.220.105.70 as permitted sender) Received: from [37.220.105.70] (HELO pingo2.arhont.com) (37.220.105.70) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Mar 2014 12:56:11 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by pingo2.arhont.com (Postfix) with ESMTP id C07A6400B51B; Tue, 4 Mar 2014 12:55:49 +0000 (GMT) X-Virus-Scanned: amavisd-new at arhont.com Received: from pingo2.arhont.com ([127.0.0.1]) by localhost (pingo2.arhont.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id B2CkST+x++xh; Tue, 4 Mar 2014 12:55:46 +0000 (GMT) Received: from pingo2.arhont.com (pingo2.arhont.com [37.220.105.70]) by pingo2.arhont.com (Postfix) with ESMTP id B86A4400B57F; Tue, 4 Mar 2014 12:55:46 +0000 (GMT) Date: Tue, 4 Mar 2014 12:55:46 +0000 (GMT) From: Andrei Mikhailovsky To: dev@cloudstack.apache.org Cc: users@cloudstack.apache.org Message-ID: <1505917745.458803.1393937746617.JavaMail.root@arhont.com> In-Reply-To: <5315AC3C.9020206@isg.si> Subject: Re: ALARM - ACS reboots host servers!!! MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_458802_547086536.1393937746616" X-Mailer: Zimbra 7.2.4_GA_2900 (ZimbraWebClient - GC32 (Linux)/7.2.4_GA_2900) X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_458802_547086536.1393937746616 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit I agree with France, sounds like a more sensible idea and killing hosts left, right and centre with live vms. I now understand the reasons behind killing the troubled host server, however, this should be done without killing live vms with fully working volumes. Regarding having nfs and ceph storage in different clusters - sounds like a good idea for majority of cases, however, my setup will not allow me to do that just yet. I am using ceph for my root and data volumes and NFS for backup volumes. I do currently need the backup volumes as snapshotting with KVM is somewhat broken / not fully working in 4.2.1. It has been improved from version 4.2.0 as it was completely broken. I am waiting for 4.3.0 where, hopefully, I would be able to keep snapshots on the primary storage (currently this feature is broken) which will make the snapshots with KVM usable. Cheers for your help guys ----- Original Message ----- From: "France" To: users@cloudstack.apache.org, dev@cloudstack.apache.org Sent: Tuesday, 4 March, 2014 10:34:36 AM Subject: Re: ALARM - ACS reboots host servers!!! Hi Marcus and others. There is no need to kill of the entire hypervisor, if one of the primary storages fail. You just need to kill the VMs and probably disable SR on XenServer, because all other SRs and VMs have no problems. if you kill those, then you can safely start them elsewhere. On XenServer 6.2 you call destroy the VMs which lost access to NFS without any problems. If you really want to still kill the entire host and it's VMs in one go, I would suggest live migrating the VMs which have had not lost their storage off first, and then kill those VMs on a stale NFS by doing hard reboot. Additional time, while migrating working VMs, would even give some grace time for NFS to maybe recover. :-) Hard reboot to recover from D state of NFS client can also be avoided by using soft mount options. I run a bunch of Pacemaker/Corosync/Cman/Heartbeat/etc clusters and we don't just kill whole nodes but fence services from specific nodes. STONITH is implemented only when the node looses the quorum. Regards, F. On 3/3/14 5:35 PM, Marcus wrote: > It's the standard clustering problem. Any software that does any sort > of avtive clustering is going to fence nodes that have problems, or > should if it cares about your data. If the risk of losing a host due > to a storage pool outage is too great, you could perhaps look at > rearranging your pool-to-host correlations (certain hosts run vms from > certain pools) via clusters. Note that if you register a storage pool > with a cluster, it will register the pool with libvirt when the pool > is not in maintenance, which, when the storage pool goes down will > cause problems for the host even if no VMs from that storage are > running (fetching storage stats for example will cause agent threads > to hang if its NFS), so you'd need to put ceph in its own cluster and > NFS in its own cluster. > > It's far more dangerous to leave a host in an unknown/bad state. If a > host loses contact with one of your storage nodes, with HA, cloudstack > will want to start the affected VMs elsewhere. If it does so, and your > original host wakes up from it's NFS hang, you suddenly have a VM > running in two locations, corruption ensues. You might think we could > just stop the affected VMs, but NFS tends to make things that touch it > go into D state, even with 'intr' and other parameters, which affects > libvirt and the agent. > > We could perhaps open a feature request to disable all HA and just > leave things as-is, disallowing operations when there are outages. If > that sounds useful you can create the feature request on > https://issues.apache.org/jira. > > > On Mon, Mar 3, 2014 at 5:37 AM, Andrei Mikhailovsky wrote: >> Koushik, I understand that and I will put the storage into the maintenance mode next time. However, things happen and servers crash from time to time, which is not the reason to reboot all host servers, even those which do not have any running vms with volumes on the nfs storage. The bloody agent just rebooted every single host server regardless if they were running vms with volumes on the rebooted nfs server. 95% of my vms are running from ceph and those should have never been effected in the first place. >> ----- Original Message ----- >> >> From: "Koushik Das" >> To: "" >> Cc: dev@cloudstack.apache.org >> Sent: Monday, 3 March, 2014 5:55:34 AM >> Subject: Re: ALARM - ACS reboots host servers!!! >> >> The primary storage needs to be put in maintenance before doing any upgrade/reboot as mentioned in the previous mails. >> >> -Koushik >> >> On 03-Mar-2014, at 6:07 AM, Marcus wrote: >> >>> Also, please note that in the bug you referenced it doesn't have a >>> problem with the reboot being triggered, but with the fact that reboot >>> never completes due to hanging NFS mount (which is why the reboot >>> occurs, inaccessible primary storage). >>> >>> On Sun, Mar 2, 2014 at 5:26 PM, Marcus wrote: >>>> Or do you mean you have multiple primary storages and this one was not >>>> in use and put into maintenance? >>>> >>>> On Sun, Mar 2, 2014 at 5:25 PM, Marcus wrote: >>>>> I'm not sure I understand. How do you expect to reboot your primary >>>>> storage while vms are running? It sounds like the host is being >>>>> fenced since it cannot contact the resources it depends on. >>>>> >>>>> On Sun, Mar 2, 2014 at 3:24 PM, Nux! wrote: >>>>>> On 02.03.2014 21:17, Andrei Mikhailovsky wrote: >>>>>>> Hello guys, >>>>>>> >>>>>>> >>>>>>> I've recently came across the bug CLOUDSTACK-5429 which has rebooted >>>>>>> all of my host servers without properly shutting down the guest vms. >>>>>>> I've simply upgraded and rebooted one of the nfs primary storage >>>>>>> servers and a few minutes later, to my horror, i've found out that all >>>>>>> of my host servers have been rebooted. Is it just me thinking so, or >>>>>>> is this bug should be fixed ASAP and should be a blocker for any new >>>>>>> ACS release. I mean not only does it cause downtime, but also possible >>>>>>> data loss and server corruption. >>>>>> >>>>>> Hi Andrei, >>>>>> >>>>>> Do you have HA enabled and did you put that primary storage in maintenance >>>>>> mode before rebooting it? >>>>>> It's my understanding that ACS relies on the shared storage to perform HA so >>>>>> if the storage goes it's expected to go berserk. I've noticed similar >>>>>> behaviour in Xenserver pools without ACS. >>>>>> I'd imagine a "cure" for this would be to use network distributed >>>>>> "filesystems" like GlusterFS or CEPH. >>>>>> >>>>>> Lucian >>>>>> >>>>>> -- >>>>>> Sent from the Delta quadrant using Borg technology! >>>>>> >>>>>> Nux! >>>>>> www.nux.ro >> ------=_Part_458802_547086536.1393937746616--