From dev-return-110668-archive-asf-public=cust-asf.ponee.io@cloudstack.apache.org Fri Jan 19 10:54:54 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 7FC33180607 for ; Fri, 19 Jan 2018 10:54:54 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 6A043160C28; Fri, 19 Jan 2018 09:54:54 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 82CBE160C27 for ; Fri, 19 Jan 2018 10:54:53 +0100 (CET) Received: (qmail 45459 invoked by uid 500); 19 Jan 2018 09:54:52 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 45445 invoked by uid 99); 19 Jan 2018 09:54:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jan 2018 09:54:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B7498C2F24 for ; Fri, 19 Jan 2018 09:54:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.699 X-Spam-Level: X-Spam-Status: No, score=0.699 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_ASCII_DIVIDERS=0.8, KAM_SHORT=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=li.nux.ro Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id GTCh2yzkkc7U for ; Fri, 19 Jan 2018 09:54:47 +0000 (UTC) Received: from mailserver.lastdot.org (mailserver.lastdot.org [31.193.175.196]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 861B75F27E for ; Fri, 19 Jan 2018 09:54:47 +0000 (UTC) Received: from localhost (localhost [IPv6:::1]) by mailserver.lastdot.org (Postfix) with ESMTP id 5C7292B2EB5; Fri, 19 Jan 2018 09:54:38 +0000 (GMT) Received: from mailserver.lastdot.org ([IPv6:::1]) by localhost (mailserver.lastdot.org [IPv6:::1]) (amavisd-new, port 10032) with ESMTP id qhWHDC6Gupu4; Fri, 19 Jan 2018 09:54:35 +0000 (GMT) Received: from localhost (localhost [IPv6:::1]) by mailserver.lastdot.org (Postfix) with ESMTP id B55082B2EB6; Fri, 19 Jan 2018 09:54:34 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.10.3 mailserver.lastdot.org B55082B2EB6 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=li.nux.ro; s=C605E3A6-F3C6-11E3-AEB0-DFF9218DCAC4; t=1516355675; bh=0CPj49TH4gn+AsQ8fAiSlZroF7VqxOlmYHzSVxq1wPQ=; h=Date:From:To:Message-ID:MIME-Version; b=NW7L4U/wdbTeaCUeBRqh8QWBBNT0omWQeBHAVNYulfGnYsL1iGOgM7biiO3NTPuUS VDtp0xcln8AQAeraMcOM99oamtQmCdBVmMUOv8kjbN6p5j0ezTLKbyrGf7cmjJI+M5 eMtAwVHgcc80Nsxxf2UP+AydWxvIoDixGKxDNitA= X-Virus-Scanned: amavisd-new at mailserver.lastdot.org Received: from mailserver.lastdot.org ([IPv6:::1]) by localhost (mailserver.lastdot.org [IPv6:::1]) (amavisd-new, port 10026) with ESMTP id 8lNvPRq55sFh; Fri, 19 Jan 2018 09:54:34 +0000 (GMT) Received: from mailserver.lastdot.org (mailserver.lastdot.org [31.193.175.196]) by mailserver.lastdot.org (Postfix) with ESMTP id 1876B2B2EB5; Fri, 19 Jan 2018 09:54:34 +0000 (GMT) Date: Fri, 19 Jan 2018 09:54:33 +0000 (GMT) From: Nux! To: dev Cc: Daan Hoogland , Nicolas Vazquez , Boris Stoyanov Message-ID: <2086544442.13435.1516355673291.JavaMail.zimbra@li.nux.ro> In-Reply-To: References: <1106938586.10719.1516043876714.JavaMail.zimbra@li.nux.ro> <1793987258.11114.1516102558899.JavaMail.zimbra@li.nux.ro> <912711686.11351.1516122755733.JavaMail.zimbra@li.nux.ro> <1261334729.12260.1516208520277.JavaMail.zimbra@li.nux.ro> Subject: Re: HA issues MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Mailer: Zimbra 8.7.0_GA_1659 (ZimbraWebClient - FF57 (Linux)/8.7.0_GA_1659) Thread-Topic: HA issues Thread-Index: xNdOWdWkSWRG4gfHpr8udP5goVdwGfiAw8Xv3ALfyfjQZ586LtOW6sKFgAA3X5vEq7Nq4btXOsXmGYWKpVs= Thanks Rohit, I'll do more tests, try to figure it out. This thing is happening to me consistently on this setup, I'll use another and basic networking, see if it yields different results. -- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro ----- Original Message ----- > From: "Rohit Yadav" > To: "dev" , "Daan Hoogland" , "Nicolas Vazquez" > , "Boris Stoyanov" > Sent: Friday, 19 January, 2018 08:59:00 > Subject: Re: HA issues > Hi Lucian, > > > Thanks for sharing, I still could not reproduce the issue. In my case, the KVM > host went to "Down" state and VMs were started on other hosts. Given this may > not be a generally reproducible issue, it could be marked Critical but may be > not a blocker? > > > Please open/update JIRA ticket with the details. /cc @Daan > Hoogland @Nicolas > Vazquez @Boris > Stoyanov and others > > > - Rohit > > > > > > ________________________________ > From: Nux! > Sent: Wednesday, January 17, 2018 10:32:00 PM > To: dev > Subject: Re: HA issues > > Hi Rohit, > > I've reinstalled and tested. Still no go with VM HA. > > What I did was to kernel panic that particular HV ("echo c > > /proc/sysrq-trigger" <- this is a proper way to simulate a crash). > What happened next is the HV got marked as "Alert", the VM on it was all the > time marked as "Running" and it was not migrated to another HV. > Once the panicked HV has booted back the VM reboots and becomes available. > > I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage. The VM > has HA enabled service offering. > Host HA or OOBM configuration was not touched. > > Full log http://tmp.nux.ro/W3s-management-server.log > > -- > Sent from the Delta quadrant using Borg technology! > > Nux! > www.nux.ro > > > rohit.yadav@shapeblue.com > www.shapeblue.com > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > @shapeblue > > > > ----- Original Message ----- >> From: "Rohit Yadav" >> To: "dev" >> Sent: Wednesday, 17 January, 2018 12:13:33 >> Subject: Re: HA issues > >> I performed VM HA sanity checks and was not able to reproduce any regression >> against two KVM CentOS7 hosts in a cluster. >> >> >> Without the "Host HA" feature, I deployed few HA-enabled VMs on a KVM host2 and >> killed it (powered off). After few minutes of CloudStack attempting to find why >> the host (kvm agent) timed out, CloudStack kicked investigators, that >> eventually led KVM fencers to work and VM HA job kicked to start those few VMs >> on host1 and the KVM host2 was put to "Down" state. >> >> >> - Rohit >> >> >> >> >> >> ________________________________ >> >> rohit.yadav@shapeblue.com >> www.shapeblue.com >> 53 Chandos Place, Covent Garden, London WC2N 4HSUK >> @shapeblue >> >> >> >> From: Rohit Yadav >> Sent: Wednesday, January 17, 2018 2:39:19 PM >> To: dev >> Subject: Re: HA issues >> >> >> Hi Lucian, >> >> >> The "Host HA" feature is entirely different from VM HA, however, they may work >> in tandem, so please stop using the terms interchangeably as it may cause the >> community to believe a regression has been caused. >> >> >> The "Host HA" feature currently ships with only "Host HA" provider for KVM that >> is strictly tied to out-of-band management (IPMI for fencing, i.e power off and >> recovery, i.e. reboot) and NFS (as primary storage). (We also have a provider >> for simulator, but that's for coverage/testing purposes). >> >> >> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is enabled. >> The frameowkr allows interested parties may write their own HA providers for a >> hypervisor that can use a different strategy/mechanism for fencing/recovery of >> hosts (including write a non-IPMI based OOBM plugin) and host/disk activity >> checker that is non-NFS based. >> >> >> The "Host HA" feature ships disabled by default and does not cause any >> interference with VM HA. However, when enabled and configured correctly, it is >> a known limitation that when it is unable to successfully perform recovery or >> fencing tasks it may not trigger VM HA. We can discuss how to handle such cases >> (thoughts?). "Host HA" would try couple of times to recover and failing to do >> so, it would eventually trigger a host fencing task. If it's unable to fence a >> host, it will indefinitely attempt to fence the host (the host state will be >> stuck at fencing state in cloud.ha_config table for example) and alerts will be >> sent to admin who can do some manual intervention to handle such situations (if >> you've email/smtp enabled, you should see alert emails). >> >> >> We can discuss how to improve and have a workaround for the case you've hit, >> thanks for sharing. >> >> >> - Rohit >> >> ________________________________ >> From: Nux! >> Sent: Tuesday, January 16, 2018 10:42:35 PM >> To: dev >> Subject: Re: HA issues >> >> Ok, reinstalled and re-tested. >> >> What I've learned: >> >> - HA only works now if OOB is configured, the old way HA no longer applies - >> this can be good and bad, not everyone has IPMIs >> >> - HA only works if IPMI is reachable. I've pulled the cord on a HV and HA failed >> to do its thing, leaving me with a HV down along with all the VMs running >> there. That's bad. >> I've opened this ticket for it: >> https://issues.apache.org/jira/browse/CLOUDSTACK-10234 >> >> Let me know if you need any extra info or stuff to test. >> >> Regards, >> Lucian >> >> -- >> Sent from the Delta quadrant using Borg technology! >> >> Nux! >> www.nux.ro >> >> ----- Original Message ----- >>> From: "Nux!" >>> To: "dev" >>> Sent: Tuesday, 16 January, 2018 11:35:58 >>> Subject: Re: HA issues >> >>> I'll reinstall my setup and try again, just to be sure I'm working on a clean >>> slate. >>> >>> -- >>> Sent from the Delta quadrant using Borg technology! >>> >>> Nux! >>> www.nux.ro >>> >>> ----- Original Message ----- >>>> From: "Rohit Yadav" >>>> To: "dev" >>>> Sent: Tuesday, 16 January, 2018 11:29:51 >>>> Subject: Re: HA issues >>> >>>> Hi Lucian, >>>> >>>> >>>> If you're talking about the new HostHA feature (with KVM+nfs+ipmi), please refer >>>> to following docs: >>>> >>>> http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/latest/hosts.html#out-of-band-management >>>> >>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA >>>> >>>> >>>> We'll need to you look at logs perhaps create a JIRA ticket with the logs and >>>> details? If you saw ipmi based reboot, then host-ha indeed tried to recover >>>> i.e. reboot the host, once hostha has done its work it would schedule HA for VM >>>> as soon as the recovery operation succeeds (we've simulator and kvm based >>>> marvin tests for such scenarios). >>>> >>>> >>>> Can you see it making attempt to schedule VM ha in logs, or any failure? >>>> >>>> >>>> - Rohit >>>> >>>> >>>> >>>> >>>> >>>> ________________________________ >>>> From: Nux! >>>> Sent: Tuesday, January 16, 2018 12:47:56 AM >>>> To: dev >>>> Subject: [4.11] HA issues >>>> >>>> Hi, >>>> >>>> I see there's a new HA engine for KVM and IPMI support which is really nice, >>>> however it seems hit and miss. >>>> I have created an instance with HA offering, kernel panicked one of the >>>> hypervisors - after a while the server was rebooted via IPMI probably, but the >>>> instance never moved to a running hypervisor and even after the original >>>> hypervisor came back it was still left in Stopped state. >>>> Is there any extra things I need to set up to have proper HA? >>>> >>>> Regards, >>>> Lucian >>>> >>>> -- >>>> Sent from the Delta quadrant using Borg technology! >>>> >>>> Nux! >>>> www.nux.ro >>>> >>>> rohit.yadav@shapeblue.com >>>> www.shapeblue.com >>>> 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > > > @shapeblue