cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joris van Lieshout (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CLOUDSTACK-7853) Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert
Date Thu, 06 Nov 2014 11:52:46 GMT
Joris van Lieshout created CLOUDSTACK-7853:
----------------------------------------------

             Summary: Hosts that are temporary Disconnected and get behind on ping (PingTimeout)
turn up in permanent state Alert
                 Key: CLOUDSTACK-7853
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
    Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
            Reporter: Joris van Lieshout
            Priority: Critical


If for some reason (I've been unable to determine why but my suspicion is that the management
server is busy processing other agent requests and/or xapi is temporary unavailable) a host
that is Disconnected gets behind on ping (PingTimeout) it it transitioned to a permanent state
of Alert.

INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the following agents
behind on ping: [421, 427, 425]
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, do invstigation
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state = Enabled, Agent
event = PingTimeout, Host id = 421, name = xxxxxx1]
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 421; name =
xxxxxx1; old status = Disconnected; event = PingTimeout; new status = Alert; old update count
= 111; new update count = 112]

----/ next cycle / -----

INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the following agents
behind on ping: [421, 427, 425]
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, do invstigation
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state = Enabled, Agent
event = PingTimeout, Host id = 421, name = xxxxxx1]
DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent status with event
PingTimeout for host 421, name=xxxxxx1, mangement server id is 345052370017
ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the following exception:

com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status with event PingTimeout
for host 421, mangement server id is 345052370017,Unable to transition to a new state from
Alert via PingTimeout
        at com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
        at com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
        at com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
        at com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
        at com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:701)

I think the bug occures because there is no valid state transition from Alert via PingTimeout
to something recoverable

Status.java
		s_fsm.addTransition(Status.Alert, Event.AgentConnected, Status.Connecting);
        s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up);
        s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed);
        s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, Status.Alert);
        s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, Status.Alert);
        s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, Status.Disconnected);

 As a workaround to get out of this situation we put the cluster in Unmanage, wait 10 minutes
and put the cluster back in manage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message