cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koushik Das (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (CLOUDSTACK-2918) In a scaled up environment, hosts fail to come up after Management server restart in clustered set up
Date Wed, 10 Jul 2013 08:09:48 GMT

     [ https://issues.apache.org/jira/browse/CLOUDSTACK-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Koushik Das resolved CLOUDSTACK-2918.
-------------------------------------

    Resolution: Fixed
    
> In a scaled up environment, hosts fail to come up after Management server restart in
clustered set up
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-2918
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2918
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.2.0
>         Environment: Load Test environment with simulator
>            Reporter: Sowmya Krishnan
>            Assignee: Koushik Das
>             Fix For: 4.2.0
>
>
> Scaled setup with 20K simulated hosts and as many VMs, Basic Zone, Clustered setup with
3 management servers.
> After restarting one or all the Management servers, re balancing hosts takes very long
time, and the rebalanced hosts don't come to Up state either.
> Following is the error message from logs when the MS was starting up:
> 2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null)
Joining node, IP: 10.223.48.2, msid: 206915885094132
> 2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null)
Joining node, IP: 10.223.48.66, msid: 206915885097283
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null)
Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null)
Handle cluster node join alert, joined node: 10.223.48.2, msidL: 206915885094132
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null)
Handle cluster node join alert, joined node: 10.223.48.66, msidL: 206915885097283
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null)
Management server node 10.223.48.66 is up, send alert
> 2013-05-31 00:00:22,539 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null)
Notifying management server join event took 178 ms
> 2013-05-31 00:00:25,825 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 5583ms, profilerH
> eartbeatUpdate: Done. Duration: 187ms, profilerPeerScan: Done. Duration: 560ms, profilerAgentLB:
Done. Duration: 4836ms
> 2013-05-31 00:00:25,963 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Detected management node joined, id:3, nodeIP:10.223.48.130
> 2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null)
Notify management server node join to listeners.
> 2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null)
Joining node, IP: 10.223.48.130, msid: 206915885093830
> 2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null)
Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs
> 2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null)
Handle cluster node join alert, joined node: 10.223.48.130, msidL: 206915885093830
> 2013-05-31 00:00:25,964 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null)
Notifying management server join event took 0 ms
> 2013-05-31 00:00:27,501 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH
> eartbeatUpdate: Done. Duration: 135ms, profilerPeerScan: Done. Duration: 179ms, profilerAgentLB:
Done. Duration: 1361ms
> 2013-05-31 00:00:29,389 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 1888ms, profilerH
> eartbeatUpdate: Done. Duration: 462ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB:
Done. Duration: 1424ms
> 2013-05-31 00:00:33,912 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 1643ms, profilerH
> eartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB:
Done. Duration: 1501ms
> 2013-05-31 00:00:41,328 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH
> eartbeatUpdate: Done. Duration: 178ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB:
Done. Duration: 1495ms
> 2013-05-31 00:00:48,799 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Running
Capacity Checker ...
> 2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) recalculating
system capacity
> 2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing
cpu/ram capacity update
> 2013-05-31 00:00:48,883 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null)
Found 0 routers to update status.
> 2013-05-31 00:00:48,901 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null)
Found 0 networks to update RvR status.
> 2013-05-31 00:00:50,031 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 1596ms, profilerHeartbeatUpdate:
Done. Duration: 179ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration:
1416ms
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate cpu capacity, host:2 usedCpu: 500 reservedCpu: 0
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0
> ...
> ...
> ...
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate memory capacity, host:5 usedMem: 0 reservedMem: 0
> 2013-05-31 00:00:50,414 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate cpu capacity, host:6 usedCpu: 0 reservedCpu: 0
> ...
> ...
> ...
> 2013-05-31 00:03:05,249 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null)
No need to calibrate memory capacity, host:20002 usedMem: 0 reservedMem: 0
> 2013-05-31 00:03:05,249 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done
executing cpu/ram capacity update
> 2013-05-31 00:03:05,250 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing
storage capacity update
> 2013-05-31 00:03:05,890 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null)
Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId
- 20
> 0, PodId 1
> 2013-05-31 00:03:06,080 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null)
Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId
- 20
> 1, PodId 1
> 2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done
executing storage capacity update
> 2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing
capacity updates for public ip and Vlans
> 2013-05-31 00:04:33,488 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 1565ms, profilerHeartbeatUpdate:
Done. Duration: 122ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration:
1441ms
> 2013-05-31 00:04:35,070 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 1582ms, profilerHeartbeatUpdate:
Done. Duration: 172ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration:
1408ms
> 2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done
capacity updates for public ip and Vlans
> 2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing
capacity updates for private ip
> 2013-05-31 00:04:37,155 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 2085ms, profilerHeartbeatUpdate:
Done. Duration: 158ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration:
1925ms
> ...
> ...
> ...
> 2013-05-31 01:56:10,384 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 2105ms, profilerHeartbeatUpdate:
Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration:
1963ms
> 2013-05-31 01:56:12,043 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 1659ms, profilerHeartbeatUpdate:
Done. Duration: 120ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration:
1537ms
> 2013-05-31 01:56:14,172 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null)
Management server heartbeat takes too long to finish. profiler: Done. Duration: 2129ms, profilerHeartbeatUpdate:
Done. Duration: 96ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration:
2031ms
> All 3 Management servers have the same above message and the hosts continue to be in
disconnected state
> mysql> select count(*), status, type, mgmt_server_id from host group by mgmt_server_id,
status, type;
> +----------+--------------+--------------------+-----------------+
> | count(*) | status | type | mgmt_server_id |
> +----------+--------------+--------------------+-----------------+
> | 1 | Alert | SecondaryStorage | NULL |
> | 17854 | Disconnected | Routing | NULL |
> | 1 | Disconnected | SecondaryStorageVM | NULL |
> | 53 | Disconnected | Routing | 206915885093830 |
> | 1485 | Disconnected | Routing | 206915885094132 |
> | 608 | Disconnected | Routing | 206915885097283 |
> +----------+--------------+--------------------+-----------------+
> 6 rows in set (0.07 sec)
> Observed this issue even with lesser hosts ~5K 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message