cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Indra Pramana <in...@sg.or.id>
Subject Re: KVM host agent disconnection
Date Sat, 06 Feb 2016 19:16:30 GMT
Hi Wido and all,

Good day to you.

In addition to my previous email, I noted that the latest released version
of ACS is 4.7. May I know if the problem is resolved by 4.7? I don't
think 4.8 is already available from ACS repository, unless if we get the
source and compile ourselves.

https://cloudstack.apache.org/downloads.html

I also noted that the latest version of ACS 4.7 only supports Ubuntu 14.04,
we are using Ubuntu 12.04 for all our management server and KVM host
agents. Will the latest version of ACS 4.7 work on Ubuntu 12.04?

I found below documentation on how to upgrade from 4.2 to 4.7:

http://docs.cloudstack.apache.org/projects/cloudstack-release-notes/en/4.7.0/upgrade/upgrade-4.2.html

It seems to be quite straight-forward but I noticed that the upgrade
involves installing new system VM templates and restarting of all the
system VMs, which will cause downtime.

Anyone has performed upgrade from ACS version 4.2 to 4.7 before and able to
share your experience and give some advice and/or tips?

Thank you.


On Sat, Feb 6, 2016 at 8:01 PM, Indra Pramana <indra@sg.or.id> wrote:

> Hi Wido,
>
> Good day to you, and thanks for your reply. Nice to hear from you again. :)
>
> So is this caused by a bug on 4.2 which is resolved on newer version of
> ACS? Any specific information on the bug, e.g. bug ID and description on
> how to fix it? Is there a way I can resolve the problem without having to
> upgrade?
>
> Is there any documentation I can follow on how to upgrade from 4.2 to 4.8?
> Will this be quite straight-forward or will this involve many steps? We are
> running a production environment and we don't have staging / test
> environment to play with.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
> On Sat, Feb 6, 2016 at 3:48 PM, Wido den Hollander <wido@widodh.nl> wrote:
>
>> Hi,
>>
>> > Op 5 februari 2016 om 17:24 schreef Indra Pramana <indra@sg.or.id>:
>> >
>> >
>> > Dear all,
>> >
>> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
>> > storage. In the past one week, many of our KVM host agents would often
>> be
>> > disconnected from the management server, causing the VMs to go down
>> because
>> > of HA work. While we used to have host disconnection in the past,
>> normally
>> > it would only affect just one host, but this time round, when the
>> problem
>> > happens, it would happen on multiple hosts, up to 4-5 hosts at the same
>> > time.
>> >
>>
>> Any reason to still run 4.2? I've seen this happen as well and I haven't
>> seen
>> this with recent versions of ACS.
>>
>> Could you maybe upgrade to 4.8?
>>
>> Wido
>>
>> > Nothing much I can find on both the management-server.log and agent.log,
>> > with no significant warn, error or exceptions logged before the
>> > disconnection. Here are the sample logs from the agent:
>> >
>> > ===
>> > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null)
>> Ping
>> > Interval has gone past 300000.  Attempting to reconnect.
>> > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
>> > Clearing watch list: 2
>> > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
>> > (Agent-Selector:null) Closing socket
>> > Socket[addr=/*.*.3.3,port=8250,localport=50489]
>> > 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> > Reconnecting...
>> > 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > Connecting to *.*.3.3:8250
>> > 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > SSL: Handshake done
>> > 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient]
>> (Agent-Selector:null)
>> > Connected to *.*.3.3:8250
>> > 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
>> > Connected to the server
>> > ===
>> >
>> > Sometimes, the Cloudstack agent will not be able to re-connect unless
>> if we
>> > stop and start the agent again manually:
>> >
>> > ===
>> > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null)
>> Ping
>> > Interval has gone past 300000.  Attempting to reconnect.
>> > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
>> > Clearing watch list: 2
>> > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
>> > (Agent-Selector:null) Closing socket
>> > Socket[addr=/*.*.3.3,port=8250,localport=46231]
>> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
>> > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
>> > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
>> > (AgentShutdownThread:null) Sending shutdown to management server
>> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Lost
>> > connection to the server. Dealing with the remaining commands...
>> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
>> Cannot
>> > connect because we still have 3 commands in progress.
>> > ===
>> >
>> > During the time when the problem happens, I can confirm that the CPU and
>> > memory load on the server is healthy and not overloaded. I understand
>> that
>> > such problem can happen when the hypervisor host is overloaded, however
>> > this is not the case -- especially since when such problem happens now,
>> it
>> > will strike up to 4-5 hypervisor hosts at the same time.
>> >
>> > On the management server, here are the typical logs:
>> >
>> > ===
>> > 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
>> > event AgentDisconnected
>> > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) checking if agent (72) is alive
>> > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
>> > 161342671900, via: 72, Ver: v1, Flags: 100011
>> > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
>> > 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
>> > [Host:72] is unreachable: Host 72: C
>> > hannel is closed
>> > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
>> > 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72:
>> Channel
>> > is closed
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the
>> state of
>> > the host.  Moving on.
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the
>> state
>> > of the host.  Moving on
>> > .
>> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) checking if agent (72) is alive
>> > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip
>> address
>> > (*.*.3.25)
>> > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
>> > 161342671900, via: 34, Ver: v1, Flags: 100011
>> > ,
>> >
>> [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
>> > }
>> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
>> > (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
>> > MgmtId: 161342671900, via: 34, Ver: v1, F
>> > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
>> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
>> > (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
>> > 161342671900, via: 34, Ver: v1, Flags: 10,
>> >  { Answer } }
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>> > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
>> > returning that host is up
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
>> > (AgentTaskPool-10:null) ping from (34) to agent's host ip address
>> > (*.*.3.25) successful, returning th
>> > at agent is disconnected
>> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentTaskPool-10:null) null was able to determine host 72 is in
>> > Disconnected
>> > 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) The state determined is Disconnected
>> > 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
>> > 72-**-kvm-05
>> > 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
>> > (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6
>> //
>> > clusterId:: null // message:: Host disconnected, name: **-kvm-05
>> (id:72),
>> > availability zone: *****-******-01, pod: *****-******-Pod-01
>> > 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Host 72 is disconnecting with event
>> > AgentDisconnected
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) The next status of agent 72is Alert, current
>> status
>> > is Up
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Remove Agent : 72
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
>> > (AgentTaskPool-10:null) Processing Disconnect.
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
>> > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
>> > com.cloud.network.security.SecurityGroupListener
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> >
>> com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> >
>> com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
>> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-10:null) Sending Disconnect to listener:
>> > com.cloud.storage.secondary.SecondaryStorageListener
>> > ===
>> >
>> > Can anyone advise how can I start troubleshooting to find the root
>> cause of
>> > the problem? So that I can take whatever measures to prevent similar
>> > problem from happening.
>> >
>> > Any advice is greatly appreciated.
>> >
>> > Looking forward to your reply, thank you.
>> >
>> > Cheers.
>> >
>> > -ip-
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message