Return-Path: X-Original-To: apmail-cloudstack-issues-archive@www.apache.org Delivered-To: apmail-cloudstack-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3FED5189A9 for ; Fri, 24 Jul 2015 06:47:05 +0000 (UTC) Received: (qmail 45762 invoked by uid 500); 24 Jul 2015 06:47:05 -0000 Delivered-To: apmail-cloudstack-issues-archive@cloudstack.apache.org Received: (qmail 45724 invoked by uid 500); 24 Jul 2015 06:47:05 -0000 Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list issues@cloudstack.apache.org Received: (qmail 45714 invoked by uid 500); 24 Jul 2015 06:47:05 -0000 Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org Received: (qmail 45711 invoked by uid 99); 24 Jul 2015 06:47:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jul 2015 06:47:05 +0000 Date: Fri, 24 Jul 2015 06:47:05 +0000 (UTC) From: "ASF subversion and git services (JIRA)" To: cloudstack-issues@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CLOUDSTACK-8666) Put host in Alert state only after alert.wait timeout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CLOUDSTACK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640020#comment-14640020 ] ASF subversion and git services commented on CLOUDSTACK-8666: ------------------------------------------------------------- Commit 090db05821a100ead24dee90658d5b0a863a8682 in cloudstack's branch refs/heads/master from [~koushikd] [ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=090db05 ] CLOUDSTACK-8666: Put host in Alert state only after alert.wait timeout Instead of putting the host to Alert state immediately, the investigators should be allowed to run for some time based on alert.wait global config. At the end of this interval if the host state still cannot be determined then put the host in Alert. Also updated some of the log messages. This closes #621 > Put host in Alert state only after alert.wait timeout > ----------------------------------------------------- > > Key: CLOUDSTACK-8666 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8666 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the default.) > Components: Management Server > Affects Versions: 4.5.0, 4.6.0 > Reporter: Koushik Das > Assignee: Koushik Das > Fix For: 4.6.0 > > > When there is a ping timeout on a host, investigators try to determine the state of a host. If none of the investigators are able to determine the host state then the process is repeated after some time. This works most of the time except some boundary scenarios. For e.g. if last host or all host in a XS cluster are brought down then the investigators are not able to determine the host state and the investigation process never completes. In such scenarios host state always remain as Up. > In order to fix these boundary scenarios, a fix was made (refer to commit 4a13f81485c0f0664c60acafe534946e7428f080) to immediately put the host in Alert state if investigators are not able to determine the state after ping timeout. > The fix solved the boundary scenarios but introduced a new issue. Suppose there is a XS cluster with 2 hosts and the master host is brought down. In this case XS elects a new master for the cluster. Since master is down, investigators won't able to determine host state until a new master is elected. If this master election takes more than ping timeout to complete then the host is put to Alert based on the above fix. Once this happens, the host continues to remain in Alert state and no actions are taken on the VMs on this host. In this case if the investigators were allowed to run for 1 or 2 more times, possibly the new master election would have completed and host state correctly determined. > In order to fix both these issues, instead of putting the host to Alert state immediately, the investigators should be allowed to run for some time based on alert.wait global config. At the end of this interval if the host state still cannot be determined then put the host in Alert. -- This message was sent by Atlassian JIRA (v6.3.4#6332)