Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 34C69200B13 for ; Wed, 11 May 2016 05:50:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 33F52160A11; Wed, 11 May 2016 03:50:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8C222160A17 for ; Wed, 11 May 2016 05:50:14 +0200 (CEST) Received: (qmail 36425 invoked by uid 500); 11 May 2016 03:50:13 -0000 Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list issues@cloudstack.apache.org Received: (qmail 35904 invoked by uid 500); 11 May 2016 03:50:13 -0000 Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org Received: (qmail 35858 invoked by uid 99); 11 May 2016 03:50:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2016 03:50:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 00B622C1F6B for ; Wed, 11 May 2016 03:50:13 +0000 (UTC) Date: Wed, 11 May 2016 03:50:13 +0000 (UTC) From: "ASF subversion and git services (JIRA)" To: cloudstack-issues@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CLOUDSTACK-9350) Local storage hosts get HA tasks, cause issues MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 11 May 2016 03:50:15 -0000 [ https://issues.apache.org/jira/browse/CLOUDSTACK-9350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279453#comment-15279453 ] ASF subversion and git services commented on CLOUDSTACK-9350: ------------------------------------------------------------- Commit fa3bce5a83bc17f82fd9dd4dbc1b6502e64ef799 in cloudstack's branch refs/heads/master from [~williamstevens@gmail.com] [ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=fa3bce5 ] Merge pull request #1496 from shapeblue/kvm-ha CLOUDSTACK-9350: KVM-HA- Fix CheckOnHost for Local storage- KVM-HA- Fix CheckOnHost for Local storage - Also skip HA on VMs that are using local storage * pr/1496: CLOUDSTACK-9350: KVM-HA- Fix CheckOnHost for Local storage - Also skip HA on VMs that are using local storage Signed-off-by: Will Stevens > Local storage hosts get HA tasks, cause issues > ----------------------------------------------- > > Key: CLOUDSTACK-9350 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-9350 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the default.) > Affects Versions: 4.5.1 > Reporter: Abhinandan Prateek > Assignee: Abhinandan Prateek > > When a host hits its ping time out, for whatever reason, the investigators are triggered. The KVMInvestigator sends a CheckOnHostCommand to the target host, and then to all the remaining neighbor hosts in the cluster. The CheckOnHostCommand (and also FenceCommand, the code is nearly identical) is processed by the KVM agent and simply scans through all NFS primary storage looking for the host's heartbeat in the KVMHA directory. If no heartbeat file is found, it fails the check. In the case of clusters that are local-only, these hosts will always fail the check, whether it be the target host or a neighbor checking on the target. This triggers a host 'down' event, which triggers HA tasks. The HA tasks will attempt to stop any VMs on the host, and then if the VM's offering is HA-enabled it will try to restart the VM. > Our recent issue was that a management server took extraordinarily long to rotate its logs and was slow to process some host pings. The CheckOnHostCommand was sent to a suspect host, which failed because it had no primary NFS. The neighbor checks also failed to check the suspect host's heartbeat for the same reason. Then the host was marked as down and all VMs were stopped. Multiply this by a few dozen hosts. > The immediate fix, provided in the example, is a patch to KVMInvestigator which will only attempt investigation if the host's cluster has NFS storage, which is a requirement for the host to run the check, as described above. If there is none, the host state is determined to be disconnected rather than down. This means that the host will still end up in alert state and need manual investigation, but there will be no attempt to stop or HA the VMs. > Additionally, the patch catches scenarios where a cluster might have both NFS and local storage and a host ends up in 'down' state. In this case, when the HA tasks are being created, if a VM is using local storage then the HA task generation is skipped. This VM can't be started anywhere else. > We could also make the agent side more robust, in KVMHAChecker we may not want it to return 'false' if there were zero pools passed to check for HA heartbeat. Then again, maybe we do. We decided initially to patch just the server side, because it is easier to deploy. > In the long run, I'd hope that the current HA work would supercede the current KVMInvestigator and take the cluster's ability to pass any defined checks into account before checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)