cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marty Sweet <msweet....@gmail.com>
Subject Re: Production Agent Disconnect
Date Sat, 17 Aug 2013 18:07:43 GMT
Following this up, I just found the following errors on my management
server. Very odd as they are resolved within the same second, ping.interval
= 5, ping.timeout (multiplier) = 2

Thanks again,
Marty

Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentMonitor] (Thread-6:) Found the following agents behind
on ping: [40, 27, 37, 38, 29, 39]
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) Investigating why
host 40 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) Investigating why host
27 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) Investigating why host
37 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) Investigating why host
38 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) Investigating why
host 29 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) Investigating why host
39 has disconnected with event PingTimeout
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) Agent is determined to
be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) Agent is determined
to be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) Agent is determined
to be up and running
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) The state determined
is Up
Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO
 [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) Agent is determined to
be up and running



On Sat, Aug 17, 2013 at 6:58 PM, Marty Sweet <msweet.dev@gmail.com> wrote:

> Hi Guys,
>
> I have just had a VMHost randomly disconnect in production and
> subsequently take down some VMs.
> I have attached the logs (happened to be running agent trace on this
> node), but it would seem that the agent (or management?) waited 25 seconds
> before erroring, and then the cloudstack agent froze until 1800.
> I assume the agent syslog stack traces were caused by force closes of VMs,
> no other nodes were affected during this time period.
>
> While the host was in disconnect mode, I could connect to a VM which was
> running on that host, although Cloudstack was already reporting that is was
> down.
>  Would it be a good idea to ping VM's (their allocated IPs before
> attempting to start them on other nodes - especially in a HA setup)?
>
> If someone could look at the logs and let me know if there is something
> obvious it would be most appreciated, I have included the management bond
> for reference that the link didn't go down.
>
> Thanks in advance,
> Marty
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message