Return-Path: X-Original-To: apmail-cloudstack-issues-archive@www.apache.org Delivered-To: apmail-cloudstack-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 39F4D11F87 for ; Mon, 25 Aug 2014 08:11:58 +0000 (UTC) Received: (qmail 12384 invoked by uid 500); 25 Aug 2014 08:11:58 -0000 Delivered-To: apmail-cloudstack-issues-archive@cloudstack.apache.org Received: (qmail 12354 invoked by uid 500); 25 Aug 2014 08:11:58 -0000 Mailing-List: contact issues-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list issues@cloudstack.apache.org Received: (qmail 12344 invoked by uid 500); 25 Aug 2014 08:11:58 -0000 Delivered-To: apmail-incubator-cloudstack-issues@incubator.apache.org Received: (qmail 12341 invoked by uid 99); 25 Aug 2014 08:11:58 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Aug 2014 08:11:58 +0000 Date: Mon, 25 Aug 2014 08:11:58 +0000 (UTC) From: "ASF subversion and git services (JIRA)" To: cloudstack-issues@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CLOUDSTACK-7415) Host remains in Alert after vCenter restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CLOUDSTACK-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108874#comment-14108874 ] ASF subversion and git services commented on CLOUDSTACK-7415: ------------------------------------------------------------- Commit 8ce6eba549bcd3fa007aaf10a29c3a2fef9ffaaa in cloudstack's branch refs/heads/master from [~likithas] [ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=8ce6eba ] CLOUDSTACK-7415. Host remains in Alert after vCenter restart. Management server PingTask should update PingMap entry for an agent only if it is already present in the Management Server's PingMap. > Host remains in Alert after vCenter restart > ------------------------------------------- > > Key: CLOUDSTACK-7415 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7415 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the default.) > Components: Management Server > Affects Versions: 4.0.0 > Reporter: Likitha Shetty > Assignee: Likitha Shetty > Priority: Critical > Fix For: 4.5.0 > > > In a clustered management server environment, after a vCenter restart some hosts repeatedly go back into alert state even after the vCenter comes up. > Root caused the issue to the below race condition - > There is a scheduled PingTask that is run for every host and the interval at which it is run is configurable (global config - ping.interval). When vCenter gets restarted, PingTask is unable to get the host status and so it schedules another task to handle the disconnect for the host agent. > This disconnect task determines the host status by sending CheckHeathCommand to the agent. When the command returns an answer that says the resource is not alive, CS performs further investigations and in this case VMware investigator confirms the host to be in disconnected state. After which disconnect is processed which involves the following - > 1. Cancel all scheduled tasks for that agent which includes PingTask > 2. Send disconnect to all listeners including AgentMonitor which clears the agent from MS's PingMap > If the above disconnect takes a while to get scheduled and spills over to the next PingTask interval, then the next PingTask runs wherein if by now the vCenter is Up and host is connected the Ping is successful and hence an entry for the agent is made in the PingMap. > Once an entry is made in the PingMap after a disconnect, every minute the AgentMonitor task will run to find the agent behind on Ping, disconnect host agent without investigation because the attache is no longer connected and put the host back into Alert state. -- This message was sent by Atlassian JIRA (v6.2#6252)