cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wido den Hollander <w...@widodh.nl>
Subject [KVM] Agent hanging and disconnecting when libvirt doesn't respond
Date Thu, 11 Jul 2013 19:31:39 GMT
Hi,

The last two days I noticed an incident on a cluster where HA kicked in 
because a host was marked as down since the Agent disconnected.

The problem was that libvirt didn't respond to the call the agent was doing.

The underlying problem was that the Qemu/KVM process was having some 
issues and over the monitor socket never responded to libvirt and on his 
turn libvirt never responded to the Agent.

In the logs I saw:

Ping Interval has gone past 300000.  Attempting to reconnect.

DEBUG [utils.nio.NioConnection] (Agent-Selector:null) Closing socket 
Socket[addr=/XX.XX.XX.X,port=8250,localport=49098]

[cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. 
Dealing with the remaining commands...

[cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still 
have 1 commands in progress.

[cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. 
Dealing with the remaining commands...

[cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still 
have 1 commands in progress.

[cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. 
Dealing with the remaining commands...

[cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still 
have 1 commands in progress.

This kept going on and on and on until I restarted the Agent since that 
command would never come through since libvirt was blocking.

For scripts we have a timeout, so when qemu-img doesn't complete in time 
we give up, but for other commands like  this we don't have such a timeout.

What I did as a test for now is breaking out of the loop where we wait 
for any remaining commands and have the Agent reconnect. But I don't 
know if that is a good decision.

We are now assuming that libvirt always responds, but that is not the 
case. It could be numbers of reasons why libvirt can't respond.

Any suggestions on how to handle this case?

Wido


Mime
View raw message