cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koushik Das (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-4371) [Performance Testing] Basic zone with 20K Hosts, management server restart leaves the hosts in disconnected state for very long time
Date Mon, 19 Aug 2013 11:25:48 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743736#comment-13743736
] 

Koushik Das commented on CLOUDSTACK-4371:
-----------------------------------------

I verified with XS and found that local storage pool is not created for every host reconnect.
The pool is added when host gets connected for first time only (provided local storage is
enabled at zone level). Now during host reconnect there is a check to see if the local pool
already exists and in that case the creation is skipped.

So looks like a simulator setup issue based on the exception.


                
> [Performance Testing] Basic zone with 20K Hosts, management server restart leaves the
hosts in disconnected state for very long time
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-4371
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4371
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.2.0
>         Environment: Basic zone, with over 20K simulator hosts
>            Reporter: Sowmya Krishnan
>              Labels: performance
>             Fix For: 4.2.0
>
>         Attachments: ms1_restartfail.log.gz, ms2_restartfail.log.gz, ms3_restartfail.log.gz
>
>
> Basic zone performance test bed:
> 20K simulator hosts,
> 3 Management servers
> 1 host/cluster
> Local storage
> Java heap size: 12GB
> db.cloud.maxActive=2000
> direct.agent.load.size=1000
> agent.lb.enabled=true
> Deploy around 20K simulator hosts with 3 Management servers clustered
> (Not deployed any VMs yet)
> After all hosts are deployed, stop all 3 Management servers and then start all 3 one
after another
> Result
> =====
> Hosts don't get to connected state at all even after 10 minutes. While around 2K of them
go into alert state while rest are in disconnected state.
> mysql> select count(*), status, resource_state, type, mgmt_server_id from host group
by mgmt_server_id, status, type, resource_state;
> +----------+--------------+----------------+--------------------+----------------+
> | count(*) | status       | resource_state | type               | mgmt_server_id |
> +----------+--------------+----------------+--------------------+----------------+
> |     1946 | Alert        | Enabled        | Routing            |           NULL |
> |    18054 | Disconnected | Enabled        | Routing            |           NULL |
> |        1 | Disconnected | Enabled        | SecondaryStorageVM |           NULL |
> +----------+--------------+----------------+--------------------+----------------+
> 3 rows in set (0.11 sec)
> MS Logs show lot of storage pool exceptions while hosts try to get connected:
> 2013-08-16 05:49:25,592 DEBUG [agent.transport.Request] (AgentTaskPool-12:null) Seq 13-32440322:
Sending  { Cmd , MgmtId: 206915885094132, via: 13, Ver: v1, Flags: 100011, [{"com.cloud.agen
> t.api.CleanupNetworkRulesCmd":{"interval":2028,"wait":0}}] }
> 2013-08-16 05:49:25,592 DEBUG [agent.transport.Request] (AgentTaskPool-12:null) Seq 13-32440322:
Executing:  { Cmd , MgmtId: 206915885094132, via: 13, Ver: v1, Flags: 100011, [{"com.cloud.a
> gent.api.CleanupNetworkRulesCmd":{"interval":2028,"wait":0}}] }
> 2013-08-16 05:49:25,592 DEBUG [xen.discoverer.XcpServerDiscoverer] (AgentTaskPool-14:null)
Not XenServer so moving on.
> 2013-08-16 05:49:25,592 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-14:null)
Sending Connect to listener: DeploymentPlanningManagerImpl_EnhancerByCloudStack_76f3d8e4
> 2013-08-16 05:49:25,591 DEBUG [cloud.resource.AgentResourceBase] (ClusteredAgentManager
Timer:null) Deserializing simulated agent on reconnect
> 2013-08-16 05:49:25,594 INFO  [network.security.SecurityGroupListener] (AgentTaskPool-12:null)
Scheduled network rules cleanup, interval=2028
> 2013-08-16 05:49:25,594 INFO  [network.security.SecurityGroupListener] (AgentTaskPool-12:null)
Received a host startup notification
> 2013-08-16 05:49:25,595 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null)
Sending Connect to listener: StoragePoolMonitor
> ...
> ...
> 2013-08-16 05:49:25,761 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null)
Sending Connect to listener: ClusteredVirtualMachineManagerImpl_EnhancerByCloudStack_b5459b7b
> 2013-08-16 05:49:25,764 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-12:null)
Found 0 VMs for host 13
> 2013-08-16 05:49:25,765 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null)
Sending Connect to listener: LocalStoragePoolListener
> 2013-08-16 05:49:25,768 DEBUG [datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl]
(AgentTaskPool-12:null) createPool Params @ scheme - Filesystem storageHost - 172.1.3.131
hostPath - /mnt/2a2463b4-4fd2-4ac7-ad3f-040a3046e478 port - -1
> 2013-08-16 05:49:25,771 DEBUG [datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl]
(AgentTaskPool-12:null) Another active pool with the same uuid already exists
> 2013-08-16 05:49:25,772 WARN  [cloud.storage.StorageManagerImpl] (AgentTaskPool-12:null)
Unable to setup the local storage pool for Host[-13-Routing]
> com.cloud.utils.exception.CloudRuntimeException: Another active pool with the same uuid
already exists
>         at org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.initialize(CloudStackPrimaryDataStoreLifeCycleImpl.java:319)
>         at com.cloud.storage.StorageManagerImpl.createLocalStorage(StorageManagerImpl.java:647)
>         at com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>         at com.cloud.storage.LocalStoragePoolListener.processConnect(LocalStoragePoolListener.java:86)
>         at com.cloud.agent.manager.AgentManagerImpl.notifyMonitorsOfConnection(AgentManagerImpl.java:587)
>         at com.cloud.agent.manager.AgentManagerImpl.handleDirectConnectAgent(AgentManagerImpl.java:1479)
>         at com.cloud.resource.ResourceManagerImpl.createHostAndAgent(ResourceManagerImpl.java:1739)
>         at com.cloud.resource.ResourceManagerImpl.createHostAndAgent(ResourceManagerImpl.java:1901)
>         at com.cloud.agent.manager.AgentManagerImpl$SimulateStartTask.run(AgentManagerImpl.java:1130)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-08-16 05:49:25,773 INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-12:null)
Could not find exception: com.cloud.exception.ConnectionException in error code list for exceptions
> 2013-08-16 05:49:25,773 WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null)
Monitor LocalStoragePoolListener says there is an error in the connect process for 13 due
to Unable to setup the local storage pool for Host[-13-Routing]
> 2013-08-16 05:49:25,773 INFO  [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null)
Host 13 is disconnecting with event AgentDisconnected

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message