cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joris van Lieshout (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-7853) Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert
Date Mon, 10 Nov 2014 13:12:33 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204744#comment-14204744
] 

Joris van Lieshout commented on CLOUDSTACK-7853:
------------------------------------------------

What I just saw in our management log is that 3 minutes before the management server found
the host behind on ping the cluster was put in Unmanage mode (XenServer patching maintenance).

I also noticed that the AgentTaskPool threads that would do the investigation you mention
was not triggered for this host. I don't know if this is because it was busy or because the
agent thread was destroyed after the cluster was put in Unmanage. 

This is how I now believer it went.
1. Cluster Unmanage
2. Host rebooted (the brand of physical boxed we use need at least 10 minutes to reboot)
3. Host got behind on ping in the meanwhile
4. Host state transitioned from Disconnected to Alert via PingTimeout
5. On the next AgentMonitor cycle a transition was attempted form Alert via PingTimeout. Unknown
transition so exception was thrown.
6. Host returned from reboot and cluster was set to manage again
7. Due to this invalid state transition the host never transitioned from Alert to something
else.

> Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in
permanent state Alert
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7853
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>    Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>            Reporter: Joris van Lieshout
>            Priority: Critical
>
> If for some reason (I've been unable to determine why but my suspicion is that the management
server is busy processing other agent requests and/or xapi is temporary unavailable) a host
that is Disconnected gets behind on ping (PingTimeout) it it transitioned to a permanent state
of Alert.
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the following agents
behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state = Enabled,
Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 421; name
= xxxxxx1; old status = Disconnected; event = PingTimeout; new status = Alert; old update
count = 111; new update count = 112]
> ----/ next cycle / -----
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the following agents
behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state = Enabled,
Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent status with event
PingTimeout for host 421, name=xxxxxx1, mangement server id is 345052370017
> ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the following exception:

> com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status with event
PingTimeout for host 421, mangement server id is 345052370017,Unable to transition to a new
state from Alert via PingTimeout
>         at com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
>         at com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
>         at com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
>         at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>         at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>         at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:701)
> I think the bug occures because there is no valid state transition from Alert via PingTimeout
to something recoverable
> Status.java
> 		s_fsm.addTransition(Status.Alert, Event.AgentConnected, Status.Connecting);
>         s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up);
>         s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed);
>         s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, Status.Disconnected);
>  As a workaround to get out of this situation we put the cluster in Unmanage, wait 10
minutes and put the cluster back in manage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message