cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jori Liesenborgs <>
Subject Problems after management server reboot & workaround
Date Wed, 08 May 2013 18:50:17 GMT

Hi everyone,

On our cloudstack setup (4.0.2), I noticed that after a reboot of the 
management server, I was no longer able to start new instances. A 
secondary problem was that the management-server.log file filled up 
extremely fast (gigabytes in a few hours), with messages like these:

2013-05-08 05:26:10,627 DEBUG [agent.manager.ClusteredAgentAttache] 
(AgentManager-Handler-4:null) Seq 7-1033568320: Forwarding Seq 
7-1033568320:  { Cmd , MgmtId: 38424150221294, via: 7, Ver: v1, Flags: 
100111, [{"StopCommand":{"isProxy":false,"vmName":"i-2-6-VM","wait":0}}] 
} to 130450099353672

This turned out to contain an important clue: when looking at the 
'mshost' table in the 'cloud' database, instead of seeing one entry for 
the management server ID, there now were two:

| id | msid            | runid         | name          | ...
|  1 | 130450099353672 | 1367919381740 | cloud-manager | ...
|  2 |  38424150221294 | 1367950608087 | cloud-manager | ...

And these two IDs were those that were mentioned in the logfile. In 
fact, every reboot a new entry in the 'mshost' table appeared, and that 
new ID was being inserted into the 'host' entries, for system VMs 
'v-2-VM' and 's-1-VM'.

Browsing through the code, it appears that in the file, the function getManagementServerId() 
returns a static value created by the MacAddress class. Now, on a Linux 
platform (we are using ubuntu), this address is obtained from the first 
entry that the command "/sbin/ifconfig -a" shows as output. And this 
turned out to be the address of the cloud0 bridge interface, which 
changed after a reboot (or after deleting the bridge using brctl and 
restarting the entire cloudstack).

To avoid having to modify and recompile cloudstack, I created a fake 
ifconfig: a simple python process that most of the time just runs the 
real ifconfig (which I renamed to ifconfig-bin), but when called as 
"/sbin/ifconfig -a", it rearranges the output so that eth0 is shown 
first (and not cloud0). This way, the management server id is basically 
the MAC address of eth0, which stays the same after a reboot.

I haven't had the time to create a long running test yet (I only figured 
it out this afternoon), but after several reboots, the management server 
id now stays the same, and I am still able to start new instances.

Hope someone finds this useful.


View raw message