cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Remi Bergsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CLOUDSTACK-7853) Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert
Date Mon, 24 Aug 2015 17:09:45 GMT

     [ https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Remi Bergsma updated CLOUDSTACK-7853:
-------------------------------------
    Priority: Major  (was: Critical)

> Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in
permanent state Alert
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7853
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>    Affects Versions: 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>            Reporter: Joris van Lieshout
>
> If for some reason (I've been unable to determine why but my suspicion is that the management
server is busy processing other agent requests and/or xapi is temporary unavailable) a host
that is Disconnected gets behind on ping (PingTimeout) it it transitioned to a permanent state
of Alert.
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the following agents
behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state = Enabled,
Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 421; name
= xxxxxx1; old status = Disconnected; event = PingTimeout; new status = Alert; old update
count = 111; new update count = 112]
> ----/ next cycle / -----
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the following agents
behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state = Enabled,
Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent status with event
PingTimeout for host 421, name=xxxxxx1, mangement server id is 345052370017
> ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the following exception:

> com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status with event
PingTimeout for host 421, mangement server id is 345052370017,Unable to transition to a new
state from Alert via PingTimeout
>         at com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
>         at com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
>         at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>         at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:701)
> I think the bug occures because there is no valid state transition from Alert via PingTimeout
to something recoverable
> Status.java
> 		s_fsm.addTransition(Status.Alert, Event.AgentConnected, Status.Connecting);
>         s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up);
>         s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed);
>         s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, Status.Disconnected);
>  As a workaround to get out of this situation we put the cluster in Unmanage, wait 10
minutes and put the cluster back in manage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message