cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkata Siva Vijayendra Bhamidipati (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-3568) [Automation]Parallel deployment - Vmware - When deploying 30 parallel Vms , we see ~15 Vms fail in org.apache.cloudstack.storage.command.CopyCommand.
Date Thu, 25 Jul 2013 02:41:50 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719169#comment-13719169
] 

Venkata Siva Vijayendra Bhamidipati commented on CLOUDSTACK-3568:
-----------------------------------------------------------------

Went through the logs and looked at Sangeetha's setup (primary storage).

Let's take an example failure in the logs and trace the events:


2013-07-16 15:47:48,631 INFO  [storage.resource.VmwareStorageProcessor] (DirectAgent-68:10.223.57.66)
creating linked clone from template
2013-07-16 15:47:49,316 DEBUG [storage.motion.AncientDataMotionStrategy] (Job-Executor-9:job-102
= [ 1b0c2fa4-2326-4fd2-921b-98d19ca17d38 ]) copyAsync inspecting src type TEMPLATE copyAsync
inspecting dest type VOLUME
2013-07-16 15:47:49,323 DEBUG [agent.transport.Request] (Job-Executor-9:job-102 = [ 1b0c2fa4-2326-4fd2-921b-98d19ca17d38
]) Seq 1-227737728: Sending  { Cmd , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 100011,
[{"org.apache.cloudstack.storage.command.CopyCommand":{"srcTO":{"org.apache.cloudstack.storage.to.TemplateObjectTO":{"path":"6343361f88b73f178855d62eb1c3a698","origUrl":"http://nfs1.lab.vmops.com/templates/passwordssh/CentOS5.3-x86_64.ova","uuid":"1c3517c1-0226-47a7-90b3-96b7de70fc64","id":202,"format":"OVA","accountId":2,"checksum":"f6f881b7f2292948d8494db837fe0f47","hvm":true,"displayText":"test","imageDataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"7caed722-555d-3f71-8794-76c5e8d8d5ad","id":1,"poolType":"NetworkFilesystem","host":"10.223.110.232","path":"/export/home/sangeetha/master-vmware/primary","port":2049}},"name":"202-2-5806a2dd-c8ea-3a15-9b1a-af7098ec5de5"}},"destTO":{"org.apache.cloudstack.storage.to.VolumeObjectTO":{"uuid":"7a3a4484-346a-4d83-903a-59e5dd7d2dff","volumeType":"ROOT","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"7caed722-555d-3f71-8794-76c5e8d8d5ad","id":1,"poolType":"NetworkFilesystem","host":"10.223.110.232","path":"/export/home/sangeetha/master-vmware/primary","port":2049}},"name":"ROOT-106","size":0,"volumeId":104,"vmName":"i-4-106-VM","accountId":4,"format":"OVA","id":104}},"executeInSequence":false,"wait":0}}]
}
2013-07-16 15:47:49,324 DEBUG [agent.transport.Request] (Job-Executor-9:job-102 = [ 1b0c2fa4-2326-4fd2-921b-98d19ca17d38
]) Seq 1-227737728: Executing:  { Cmd , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 100011,
[{"org.apache.cloudstack.storage.command.CopyCommand":{"srcTO":{"org.apache.cloudstack.storage.to.TemplateObjectTO":{"path":"6343361f88b73f178855d62eb1c3a698","origUrl":"http://nfs1.lab.vmops.com/templates/passwordssh/CentOS5.3-x86_64.ova","uuid":"1c3517c1-0226-47a7-90b3-96b7de70fc64","id":202,"format":"OVA","accountId":2,"checksum":"f6f881b7f2292948d8494db837fe0f47","hvm":true,"displayText":"test","imageDataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"7caed722-555d-3f71-8794-76c5e8d8d5ad","id":1,"poolType":"NetworkFilesystem","host":"10.223.110.232","path":"/export/home/sangeetha/master-vmware/primary","port":2049}},"name":"202-2-5806a2dd-c8ea-3a15-9b1a-af7098ec5de5"}},"destTO":{"org.apache.cloudstack.storage.to.VolumeObjectTO":{"uuid":"7a3a4484-346a-4d83-903a-59e5dd7d2dff","volumeType":"ROOT","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"7caed722-555d-3f71-8794-76c5e8d8d5ad","id":1,"poolType":"NetworkFilesystem","host":"10.223.110.232","path":"/export/home/sangeetha/master-vmware/primary","port":2049}},"name":"ROOT-106","size":0,"volumeId":104,"vmName":"i-4-106-VM","accountId":4,"format":"OVA","id":104}},"executeInSequence":false,"wait":0}}]
}
.
.
.
2013-07-16 15:47:50,229 INFO  [vmware.mo.DatastoreMO] (DirectAgent-70:10.223.57.66) Folder
ROOT-106-104 does not exist on datastore
2013-07-16 15:47:51,025 INFO  [storage.resource.VmwareStorageProcessor] (DirectAgent-70:10.223.57.66)
creating linked clone from template
.
.

2013-07-16 15:47:53,917 INFO  [storage.resource.VmwareStorageProcessor] (DirectAgent-70:10.223.57.66)
Move volume out of volume-wrapper VM 
2013-07-16 15:47:53,927 INFO  [storage.resource.VmwareStorageProcessor] (DirectAgent-68:10.223.57.66)
Move volume out of volume-wrapper VM 
2013-07-16 15:47:55,099 ERROR [storage.resource.VmwareStorageProcessor] (DirectAgent-71:10.223.57.66)
CreateCommand failed due to Exception: java.lang.RuntimeException
Message: File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk was not found

java.lang.RuntimeException: File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk
was not found
	at com.cloud.hypervisor.vmware.util.VmwareClient.waitForTask(VmwareClient.java:290)
	at com.cloud.hypervisor.vmware.mo.HostDatastoreBrowserMO.searchDatastore(HostDatastoreBrowserMO.java:57)
	at com.cloud.hypervisor.vmware.mo.HostDatastoreBrowserMO.searchDatastore(HostDatastoreBrowserMO.java:78)
	at com.cloud.hypervisor.vmware.mo.DatastoreMO.folderExists(DatastoreMO.java:319)
	at com.cloud.storage.resource.VmwareStorageProcessor.createVMLinkedClone(VmwareStorageProcessor.java:279)
	at com.cloud.storage.resource.VmwareStorageProcessor.cloneVolumeFromBaseTemplate(VmwareStorageProcessor.java:393)
	at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:73)
	at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:49)
	at com.cloud.hypervisor.vmware.resource.VmwareResource.executeRequest(VmwareResource.java:565)
	at com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:679)
2013-07-16 15:47:55,100 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-71:null) Seq
1-227737729: Response Received: 
2013-07-16 15:47:55,100 DEBUG [agent.transport.Request] (DirectAgent-71:null) Seq 1-227737729:
Processing:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10, [{"org.apache.cloudstack.storage.command.CopyCmdAnswer":{"result":false,"details":"java.lang.RuntimeException:
File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk was not found","wait":0}}]
}
2013-07-16 15:47:55,100 DEBUG [agent.transport.Request] (Job-Executor-8:job-101 = [ cad3c4ea-3d40-46f4-bce5-26d60d4af7c1
]) Seq 1-227737729: Received:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10,
{ CopyCmdAnswer } }
2013-07-16 15:47:55,104 ERROR [storage.resource.VmwareStorageProcessor] (DirectAgent-70:10.223.57.66)
CreateCommand failed due to Exception: java.lang.RuntimeException
Message: File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk was not found


immediately followed by:



java.lang.RuntimeException: File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk
was not found
	at com.cloud.hypervisor.vmware.util.VmwareClient.waitForTask(VmwareClient.java:290)
	at com.cloud.hypervisor.vmware.mo.DatastoreMO.moveDatastoreFile(DatastoreMO.java:218)
	at com.cloud.storage.resource.VmwareStorageProcessor.createVMLinkedClone(VmwareStorageProcessor.java:292)
	at com.cloud.storage.resource.VmwareStorageProcessor.cloneVolumeFromBaseTemplate(VmwareStorageProcessor.java:393)
	at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:73)
	at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:49)
	at com.cloud.hypervisor.vmware.resource.VmwareResource.executeRequest(VmwareResource.java:565)
	at com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:679)
2013-07-16 15:47:55,627 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-70:null) Seq
1-227737728: Response Received: 
2013-07-16 15:47:55,627 DEBUG [agent.transport.Request] (DirectAgent-70:null) Seq 1-227737728:
Processing:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10, [{"org.apache.cloudstack.storage.command.CopyCmdAnswer":{"result":false,"details":"java.lang.RuntimeException:
File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk was not found","wait":0}}]
}
2013-07-16 15:47:55,627 DEBUG [agent.transport.Request] (Job-Executor-9:job-102 = [ 1b0c2fa4-2326-4fd2-921b-98d19ca17d38
]) Seq 1-227737728: Received:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10,
{ CopyCmdAnswer } }
2013-07-16 15:47:55,626 ERROR [storage.resource.VmwareStorageProcessor] (DirectAgent-68:10.223.57.66)
CreateCommand failed due to Exception: java.lang.RuntimeException




Looking at the code path that is invoked in the above operations, in VmwareResource.java and
VmwareStorageProcessor.java, the mgmt server code in the entire chain is ensuring that it
waits for the result of the asynchronous operations it fires, before attempting to proceed
to the next step.

The mgmt server does these steps broadly, in that sequence - first, it looks to see if a folder
by name ROOT-x-y already exists on primary storage. If it does, it attempts to delete it.
If not, good, it proceeds. When doing this search for the folder, since the search can take
a while, the mgmt server needs to wait for the async search operation to complete, so it waits
for the task it fired and proceeds only when it hears back from vCenter with a task completion
and result message. Next, it invokes either a linked or full clone operation to create the
VM's root disks, using a base template snapshot. Here also, it is correctly waiting for the
clone operation to complete and succeed or fail. After vCenter tells it that it has succeeded,
it then attempts to move the ROOT-x-y/ROOT-x-y.vmdk and delta/full.vmdk files outside that
directory to the root folder (primary storage root).

The problem seems to be, that the mgmt server is assuming, that as a result of the clone operation
returning after the mgmt server having duly waited for it to complete, the ROOT-x-y/ directory
and the vmdk files are present on the primary storage by then. Looks like this assumption
is wrong. I don't know why - I would also think that a clone operation returning success meant
that the files are there on the secondary storage by then.

I looked into the primary storage in Sangeetha's setup and the directory is indeed present:

[root@nfs2 primary]# ls -R ROOT-106-104/
ROOT-106-104/:
ROOT-106-104-delta.vmdk  ROOT-106-104.vmdk  ROOT-106-104.vmsd  ROOT-106-104.vmx  ROOT-106-104.vmxf
[root@nfs2 primary]#


This can only mean that vCenter is wrongly giving the mgmt server the successful completion
of the VM clone task before the VM root files actually eventually show up on storage.

One fix that I can think of is this - in createVMLinkedClone(), after we invoke createLinkedClone(),
and after it returns successfully, before we attempt to move the root files here:

        // we can't rely on un-offical API (VirtualMachineMO.moveAllVmDiskFiles() any more,
use hard-coded disk names that we know
        // to move files
        s_logger.info("Move volume out of volume-wrapper VM ");
        dsMo.moveDatastoreFile(String.format("[%s] %s/%s.vmdk", dsMo.getName(), vmdkName,
vmdkName),
                dcMo.getMor(), dsMo.getMor(),
                String.format("[%s] %s.vmdk", dsMo.getName(), vmdkName), dcMo.getMor(), true);


Put in a search for those files, and retry with a brief sleep of half a second between searches,
upto a maximum of 10 retries.

Note that the search for the file itself is already coded to wait for the asynchronous task
result.

This should give enough time to vCenter/ESX to place the files on the primary storage. Ideally,
vCenter should return a result to its API only after the files are placed on storage. Also,
this probably isn't vCenter's fault either - perhaps, it is receiving a success message from
NFS before NFS can actually manifest the files? (not too sure about this though. It sounds
far fetched.).

Hardcoding the retries and sleep interval at 10 and 0.5 seconds respectively may not suffice
in extremely large deployments - what does the community advise?

Also I don't quite have the hardware to test this massive deployment out, so will need help
to test out the fix.




                
> [Automation]Parallel deployment - Vmware - When deploying 30 parallel Vms , we see ~15
Vms fail in org.apache.cloudstack.storage.command.CopyCommand.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-3568
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3568
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server, VMware
>    Affects Versions: 4.2.0
>         Environment: Build from 4.2
>            Reporter: Sangeetha Hariharan
>            Assignee: Venkata Siva Vijayendra Bhamidipati
>            Priority: Blocker
>             Fix For: 4.2.0
>
>         Attachments: management-server.log
>
>
> Parallel deployment - Vmware - When deploying 30 parallel Vms , we see ~15 Vms fail in
org.apache.cloudstack.storage.command.CopyCommand.
> Steps to reproduce the problem:
> Set up - Advanced zone with Vmware ESXI, 5.0.0 host.
> Deploy 30 Vms in parallel in a network that is already implemented (router is already
running).
> Only 9 vms were successfully deployed.
> ~15 Vms failed in org.apache.cloudstack.storage.command.CopyCommand because of "ROOT-106-104/ROOT-106-104.vmdk
was not found" error.
>  mysql> select count(*),state,type from vm_instance where name like "hello%" group
by state,type ;
> +----------+---------+------+
> | count(*) | state   | type |
> +----------+---------+------+
> |       42 | Error   | User |
> |        9 | Running | User |
> +----------+---------+------+
> 2 rows in set (0.00 sec)
> Following exception seen in management server logs:
> 2013-07-16 15:47:55,099 ERROR [storage.resource.VmwareStorageProcessor] (DirectAgent-71:10.223.57.66)
CreateCommand failed due to Exception: java.lang.RuntimeException
> Message: File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk was not
found
> java.lang.RuntimeException: File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk
was not found
>         at com.cloud.hypervisor.vmware.util.VmwareClient.waitForTask(VmwareClient.java:290)
>         at com.cloud.hypervisor.vmware.mo.HostDatastoreBrowserMO.searchDatastore(HostDatastoreBrowserMO.java:57)
>         at com.cloud.hypervisor.vmware.mo.HostDatastoreBrowserMO.searchDatastore(HostDatastoreBrowserMO.java:78)
>         at com.cloud.hypervisor.vmware.mo.DatastoreMO.folderExists(DatastoreMO.java:319)
>         at com.cloud.storage.resource.VmwareStorageProcessor.createVMLinkedClone(VmwareStorageProcessor.java:279)
>         at com.cloud.storage.resource.VmwareStorageProcessor.cloneVolumeFromBaseTemplate(VmwareStorageProcessor.java:393)
>         at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:73)
>         at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:49)
>         at com.cloud.hypervisor.vmware.resource.VmwareResource.executeRequest(VmwareResource.java:565)
>         at com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-07-16 15:47:55,100 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-71:null)
Seq 1-227737729: Response Received:
> 2013-07-16 15:47:55,100 DEBUG [agent.transport.Request] (DirectAgent-71:null) Seq 1-227737729:
Processing:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10, [{"org.apache.cloudstack.storage.command.CopyCmdAnswer":{"result":false,"details":"java.lang.RuntimeException:
File [7caed722555d3f71879476c5e8d8d5ad] ROOT-106-104/ROOT-106-104.vmdk was not found","wait":0}}]
}
> 2013-07-16 15:47:55,100 DEBUG [agent.transport.Request] (Job-Executor-8:job-101 = [ cad3c4ea-3d40-46f4-bce5-26d60d4af7c1
]) Seq 1-227737729: Received:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10,
{ CopyCmdAnswer } }
> mysql> select count(*),state,type from vm_instance where name like "hello%" group
by state,type ;
> +----------+---------+------+
> | count(*) | state   | type |
> +----------+---------+------+
> |       42 | Error   | User |
> |        9 | Running | User |
> +----------+---------+------+
> 2 rows in set (0.00 sec)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message