cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Weingärtner <rafaelweingart...@gmail.com>
Subject Re: Timeout with live migration
Date Tue, 13 Oct 2015 23:12:54 GMT
Nice, thanks.
Did that solve your problem? Did you migrate the volume?

On Tue, Oct 13, 2015 at 7:00 PM, Ryan Farrington <rfarrington@remitdata.com>
wrote:

> Issue #1) Terrible parameter names
> https://issues.apache.org/jira/browse/CLOUDSTACK-8946
>
> Issue #2) Wait value for MigrateVolume
> https://issues.apache.org/jira/browse/CLOUDSTACK-8949
>
>
>
> ________________________________________
> From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> Sent: Tuesday, October 13, 2015 3:43 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> That is good. Now I can report to you what is in the code.
>
> Let’s start:
> First: when I looked at the problem at first time, I went straight to the
> class that sends commands to Xen, and there the ACs uses a parameter
> called: “migratewait” to control the timeout of command. You tried that,
> and you still were getting the timeout problem.
>
> That happened because despite that timeout, there is another point in ACS
> that controls the timeout of command that are send to hypervisor (not just
> Xen this time), and in that point, it is used a parameter called, “wait” as
> a default value to control timeouts of command.
>
> First conclusion, we have terrible parameter names ;)
>
> Second, when we create a “MigrateVolumeCommand” we should set a timeout
> value, this way the ACS would no use the default value of parameter “wait”.
> That timeout value should be the same as the one used on CitrixResourceBase
> and its children to control the migration of volumes.
>
> Can you report what happened to you in a Jira ticket and add my comments
> there?
> I think next Saturday I can have someone working on that for the next ACS
> release (4.7?), or even 4.6 if the PR gets accepted.
>
> Please send me the jira ticket as soon as you open it.
>
> On Tue, Oct 13, 2015 at 5:33 PM, Ryan Farrington <
> rfarrington@remitdata.com>
> wrote:
>
> > Looks like whatever change I made actually resulted in a change in
> > behavior.  Prior to the change we were seeing a message every hour
> stating
> > that the job agent was waiting now we see it waited 2hours and 8 minutes
> > without a peep before finishing. So making the change to the "wait"
> > parameter is what made the magic happen.
> >
> >
> >
> > 2015-10-13 13:21:48,281 DEBUG [c.c.a.t.Request]
> > (Job-Executor-1:ctx-8e0ebced ctx-f49e7503) Seq 38-1788936343: Sending  {
> > Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7), Ver: v1, Flags:
> > 100111,
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":805,"volumePath":"5f990946-d6b5-451e-8e78-2eefc1462253","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> > }
> > 2015-10-13 13:21:48,281 DEBUG [c.c.a.t.Request]
> > (Job-Executor-1:ctx-8e0ebced ctx-f49e7503) Seq 38-1788936343:
> Executing:  {
> > Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7), Ver: v1, Flags:
> > 100111,
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":805,"volumePath":"5f990946-d6b5-451e-8e78-2eefc1462253","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> > }
> > 2015-10-13 13:21:48,282 DEBUG [c.c.a.m.DirectAgentAttache]
> > (DirectAgent-430:ctx-ac6d7aeb) Seq 38-1788936343: Executing request
> > 2015-10-13 15:27:13,396 DEBUG [c.c.a.m.DirectAgentAttache]
> > (DirectAgent-430:ctx-ac6d7aeb) Seq 38-1788936343: Response Received:
> > 2015-10-13 15:27:13,397 DEBUG [c.c.a.t.Request]
> > (DirectAgent-430:ctx-ac6d7aeb) Seq 38-1788936343: Processing:  { Ans: ,
> > MgmtId: 42756806312036, via: 38, Ver: v1, Flags: 110,
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeAnswer":{"volumePath":"00db15be-3ccd-4648-8928-35ca90924d7c","result":true,"wait":0}}]
> > }
> > 2015-10-13 15:27:13,397 DEBUG [c.c.a.m.AgentAttache]
> > (DirectAgent-430:ctx-ac6d7aeb) Seq 38-1788936343: No more commands found
> > 2015-10-13 15:27:13,397 DEBUG [c.c.a.t.Request]
> > (Job-Executor-1:ctx-8e0ebced ctx-f49e7503) Seq 38-1788936343: Received:
> {
> > Ans: , MgmtId: 42756806312036, via: 38, Ver: v1, Flags: 110, {
> > MigrateVolumeAnswer } }
> >
> >
> > ________________________________________
> > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > Sent: Tuesday, October 13, 2015 8:09 AM
> > To: users@cloudstack.apache.org
> > Subject: [Questionable]  Re: Timeout with live migration
> >
> > Let’s wait to see if there is nothing else messing with that timeout.
> Then
> > I send you the details to put into the Jira ticket.
> >
> > On Tue, Oct 13, 2015 at 10:06 AM, Ryan Farrington <
> > rfarrington@remitdata.com
> > > wrote:
> >
> > > Rafael,
> > >     I am still a bit confused as to what you would like for me to place
> > in
> > > the JIRA ticket.  I have adjusted the "wait" parameter and will be able
> > to
> > > test it in about an hour.  But i would think the JIRA ticket should be
> as
> > > detailed as I can make it or will you be adding details once I have it
> > > created?
> > >
> > >
> > >
> > >
> > > ________________________________________
> > > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > > Sent: Tuesday, October 13, 2015 7:52 AM
> > > To: users@cloudstack.apache.org
> > > Subject: [Questionable]  Re: [Questionable] Re: Timeout with live
> > migration
> > >
> > > I guess so, for some reason that I do not understand, the code is
> > > multiplying the value from that parameter by 2, something like 18000
> > should
> > > do the tricky
> > >
> > > On Tue, Oct 13, 2015 at 12:15 AM, Ryan Farrington <
> > > rfarrington@remitdata.com
> > > > wrote:
> > >
> > > > Yes i can open JIRA tickets. What would you like for me to do?
> > > >
> > > > I'll be happy to change the "wait" parameter.  Do I assume it should
> be
> > > > 1/2 of the value i want it to be?
> > > >
> > > >
> > > >
> > > > ________________________________________
> > > > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > > > Sent: Monday, October 12, 2015 10:12 PM
> > > > To: users@cloudstack.apache.org
> > > > Subject: [Questionable]  Re: Timeout with live migration
> > > >
> > > > There is your problem, there are currently two distinct values
> > conrolling
> > > > those async jobs.
> > > > Change that value and everything will work for u.
> > > > Can you open a jira ticket?
> > > >
> > > > On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington <
> > > > rfarrington@remitdata.com
> > > > > wrote:
> > > >
> > > > > wait is currently configured to be 3600
> > > > >
> > > > >
> > > > >
> > > > > ________________________________________
> > > > > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > > > > Sent: Monday, October 12, 2015 9:46 PM
> > > > > To: users@cloudstack.apache.org
> > > > > Subject: [Questionable]  Re: Timeout with live migration
> > > > >
> > > > > I found something odd,
> > > > > can you check the parameter called "wait", what value is it using
?
> > > > >
> > > > > On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington <
> > > > > rfarrington@remitdata.com
> > > > > > wrote:
> > > > >
> > > > > > Yes the parameter was set long ago and the management server
has
> > been
> > > > > > restarted numerous time over the past few days as we played
with
> > > other
> > > > > > parameters to no effect.
> > > > > >
> > > > > > After looking at the log a little more does the "Failed to send
> > > > command,
> > > > > > due to Agent:38, com.cloud.exception.OperationTimedoutException:
> > > > Commands
> > > > > > 996939857 to Host 38 timed out after 7200" mean that the
> migration
> > > > start
> > > > > > command is being sent in some kind of synchronous mode and not
> > > > returning
> > > > > > control back to the job manager?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ________________________________________
> > > > > > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > > > > > Sent: Monday, October 12, 2015 8:46 PM
> > > > > > To: users@cloudstack.apache.org
> > > > > > Subject: [Questionable]  Re: Timeout with live migration
> > > > > >
> > > > > > I thought you using the command
> “migrateVirtualMachineWithVolume”
> > > but
> > > > it
> > > > > > seems that you are using “migrateVolume” command from ACS's
API.
> > > > > >
> > > > > >
> > > > > > For the code I debugged “migrateVirtualMachineWithVolume”,
the
> > > > parameter
> > > > > > 3600, means 1 hour of timeout.
> > > > > >
> > > > > > For the “migrateVolume” is the same, they both end up in
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
> > > > > > and in that method the parameter is the same.
> > > > > >
> > > > > >
> > > > > > If your parameter is set to 36000 (10 hours) I do not see why
you
> > are
> > > > > > getting the exception after 2 hours.
> > > > > >
> > > > > > Did you restart the management servers after you changed the
> > > parameter?
> > > > > >
> > > > > > On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
> > > > > > rfarrington@remitdata.com
> > > > > > > wrote:
> > > > > >
> > > > > > > Here is the full log, including the stack for the exception,
> that
> > > we
> > > > > get
> > > > > > > at the 2 hour mark. as for the migratewait it is set to
36000
> > which
> > > > > > should
> > > > > > > be 10 hours.
> > > > > > >
> > > > > > > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
> > > > > > > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing
> > request
> > > > > > > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:
> > > Waiting
> > > > > > some
> > > > > > > more time because this is the current command
> > > > > > > 2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find
> > > exception:
> > > > > > > com.cloud.exception.OperationTimedoutException in error
code
> list
> > > for
> > > > > > > exceptions
> > > > > > > 2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:
> > Timed
> > > > out
> > > > > > on
> > > > > > > Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via:
> > > > > 38(xen-nc-bc2b7),
> > > > > > > Ver: v1, Flags: 100111,
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> > > > > > > }
> > > > > > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:
> > > > > Cancelling.
> > > > > > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:
> No
> > > more
> > > > > > > commands found
> > > > > > > 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send
> > command,
> > > > due
> > > > > > to
> > > > > > > Agent:38, com.cloud.exception.OperationTimedoutException:
> > Commands
> > > > > > > 996939857 to Host 38 timed out after 7200
> > > > > > > 2015-10-12 18:41:20,471 DEBUG
> > [o.a.c.s.m.AncientDataMotionStrategy]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
> > > > > > > com.cloud.utils.exception.CloudRuntimeException: Failed
to send
> > > > > command,
> > > > > > > due to Agent:38,
> com.cloud.exception.OperationTimedoutException:
> > > > > Commands
> > > > > > > 996939857 to Host 38 timed out after 7200
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622)
> > > > > > >         at
> sun.reflect.GeneratedMethodAccessor335.invoke(Unknown
> > > > > Source)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > > >         at java.lang.reflect.Method.invoke(Method.java:622)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
> > > > > > >         at com.sun.proxy.$Proxy196.migrateVolume(Unknown
> Source)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd.execute(MigrateVolumeCmd.java:103)
> > > > > > >         at
> > > > com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> com.cloud.api.ApiAsyncJobDispatcher.runJobInContext(ApiAsyncJobDispatcher.java:109)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> com.cloud.api.ApiAsyncJobDispatcher$1.run(ApiAsyncJobDispatcher.java:66)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:63)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:509)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
> > > > > > >         at
> > > > > > >
> > > >
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > > > >         at
> > > > > > >
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > > > >         at
> > java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > > > > >         at java.lang.Thread.run(Thread.java:701)
> > > > > > > 2015-10-12 18:41:20,479 WARN
> > > > [o.a.c.s.d.ObjectInDataStoreManagerImpl]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Unsupported
data
> > object
> > > > > > > (VOLUME,
> > > > > > >
> > > org.apache.cloudstack.storage.datastore.PrimaryDataStoreImpl@4fa7a45f
> > > > > ),
> > > > > > > no need to delete from object in store ref table
> > > > > > > 2015-10-12 18:41:20,479 DEBUG [c.c.s.VolumeApiServiceImpl]
> > > > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) migrate volume
> > > > > > > failed:com.cloud.utils.exception.CloudRuntimeException:
Failed
> to
> > > > send
> > > > > > > command, due to Agent:38,
> > > > > com.cloud.exception.OperationTimedoutException:
> > > > > > > Commands 996939857 to Host 38 timed out after 7200
> > > > > > > 2015-10-12 18:41:20,480 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
> > > > > > > (Job-Executor-63:ctx-f7b6817d) Complete async job-5257,
> > jobStatus:
> > > > > > FAILED,
> > > > > > > resultCode: 530, result:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed
> > > > > > > to migrate volume"}
> > > > > > > 2015-10-12 18:41:20,486 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
> > > > > > > (Job-Executor-63:ctx-f7b6817d) Done executing
> > > > > > > org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd
> > for
> > > > > > job-5257
> > > > > > > 2015-10-12 18:41:20,489 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
> > > > > > > (Job-Executor-63:ctx-f7b6817d) Remove job-5257 from job
> > monitoring
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________________
> > > > > > > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > > > > > > Sent: Monday, October 12, 2015 8:24 PM
> > > > > > > To: users@cloudstack.apache.org
> > > > > > > Subject: [Questionable]  Re: [Questionable] Re: Timeout
with
> live
> > > > > > migration
> > > > > > >
> > > > > > > Now I understand what you are doing, I am familiar with
that
> > > concept
> > > > > > (live
> > > > > > > migration of VM within a cluster, having the VHD being
moved
> from
> > > one
> > > > > SR
> > > > > > to
> > > > > > > another).
> > > > > > >
> > > > > > > I just got confused when I read live migration of volumes
(a
> > volume
> > > > > does
> > > > > > > not run by itself, so that why I asked a little for some
more
> > > > > > information).
> > > > > > >
> > > > > > > Looking at the source code this is the variable used to
control
> > the
> > > > > > > timeout:
> > > > > > > "long timeout = (_migratewait) * 1000L;"
> > > > > > >
> > > > > > > The value of "_migratewait" is taken from this parameter:
> > > > > > > value = (String) params.get("migratewait");
> > > > > > > _migratewait = NumbersUtil.parseInt(value, 3600);
> > > > > > >
> > > > > > > Therefore, the name of the parameter to be configured is
> > > > "migratewait",
> > > > > > the
> > > > > > > default value is 3600.
> > > > > > >
> > > > > > >
> > > > > > > BTW1: I think that is a terrible parameter name. We should
> > refactor
> > > > > that,
> > > > > > > could you open a Jira ticket for that?
> > > > > > >
> > > > > > > BTW2: that error message you posted does not seem to be
related
> > to
> > > > the
> > > > > > > migration timeout; hence, in the code if the copy times
out the
> > > > message
> > > > > > > would be:
> > > > > > > "Async " + timeout/1000 + " seconds timeout for task "
+
> > > > > task.toString()"
> > > > > > >
> > > > > > > Maybe because it throws a "Types.BadAsyncResult(msg)" and
that
> > > might
> > > > be
> > > > > > > translated into that message, or that might not be related
to
> the
> > > > > problem
> > > > > > > itself, and you just thought that it was.
> > > > > > >
> > > > > > >
> > > > > > > Does it help you?
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Oct 12, 2015 at 10:00 PM, Ryan Farrington <
> > > > > > > rfarrington@remitdata.com
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > Hypervisor:  XenServer
> > > > > > > >
> > > > > > > > We are moving a data volume from one storage onto
another
> > without
> > > > > > > shutting
> > > > > > > > down the VM cause that would just be silly and a triplication
> > of
> > > > > effort
> > > > > > > > with the whole copying to secondary storage and then
back off
> > > > again.
> > > > > > The
> > > > > > > > volume is staying in the same cluster just moving
to a
> > different
> > > > > > Primary
> > > > > > > > storage (or SR in the XenServer vernacular)
> > > > > > > >
> > > > > > > > If you are familiar with ESX this is a "Storage VMotion"
> where
> > as
> > > > in
> > > > > > > > XenServer it is called "Storage XenMotion".
> > > > > > > >
> > > > > > > > ________________________________________
> > > > > > > > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > > > > > > > Sent: Monday, October 12, 2015 7:53 PM
> > > > > > > > To: users@cloudstack.apache.org
> > > > > > > > Subject: [Questionable]  Re: Timeout with live migration
> > > > > > > >
> > > > > > > > what do you mean with livre migrating data volume
?!
> > > > > > > > I understand a live migration of a VM, but volumes...
> > > > > > > >
> > > > > > > > do you mean live migrating a VM that has a volume
attached?
> > > > > > > > are you migrating that volume to a different cluster?
or
> just a
> > > > > > different
> > > > > > > > storage in the same cluster?
> > > > > > > > What hypervisor are you using ?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington <
> > > > > > > > rfarrington@remitdata.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Live migrating a data volume. We are purely on
shared
> storage
> > > so
> > > > no
> > > > > > > local
> > > > > > > > > storage is involved.
> > > > > > > > >
> > > > > > > > > ________________________________________
> > > > > > > > > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > > > > > > > > Sent: Monday, October 12, 2015 7:37 PM
> > > > > > > > > To: users@cloudstack.apache.org
> > > > > > > > > Subject: [Questionable]  Re: Timeout with live
migration
> > > > > > > > >
> > > > > > > > > Are you live migrating a VM, or migrating a volume
of a
> > stopped
> > > > VM
> > > > > > to a
> > > > > > > > > different primary storage?
> > > > > > > > >
> > > > > > > > > If it is a running VM, is the VM allocated in
a shared
> > storage
> > > or
> > > > > > local
> > > > > > > > > storage?
> > > > > > > > >
> > > > > > > > > On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington
<
> > > > > > > > > rfarrington@remitdata.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The slow transfer is related to the storage
we are trying
> > to
> > > > > > migrate
> > > > > > > > off
> > > > > > > > > > of.  We are capable of getting about 350mbps
off the
> disks
> > > but
> > > > > when
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > moving volumes that are greater than about
500GB we end
> up
> > > > racing
> > > > > > the
> > > > > > > > > clock
> > > > > > > > > > and hoping that the migration finishes before
the job
> times
> > > > out.
> > > > > >  It
> > > > > > > > > would
> > > > > > > > > > be awesome to be able to manage that timeout
and I know
> > there
> > > > > are a
> > > > > > > ton
> > > > > > > > > of
> > > > > > > > > > settings I just don't know about and am
hoping someone
> > might
> > > be
> > > > > > able
> > > > > > > to
> > > > > > > > > > point me in the right direction.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ________________________________________
> > > > > > > > > > From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> > > > > > > > > > Sent: Monday, October 12, 2015 6:40 PM
> > > > > > > > > > To: users@cloudstack.apache.org
> > > > > > > > > > Subject: [Questionable]  Re: Timeout with
live migration
> > > > > > > > > >
> > > > > > > > > > I would first check your NICs' speed and
load, the amount
> > of
> > > > RAM
> > > > > > > > > allocated
> > > > > > > > > > for the migrating VM and than check the
hypervisor log
> > files.
> > > > > > > > > >
> > > > > > > > > > On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve
Nygård <
> > > > > > > > > > jan.arve.nygard@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > What version are you running? Check
if the
> > copy.volume.wait
> > > > > > setting
> > > > > > > > is
> > > > > > > > > > set
> > > > > > > > > > > to 7200 and increase it. If not you
could also check
> > > > > > > > > > > job.cancel.threshold.minutes and job.expire.minutes.
> > > > > > > > > > >
> > > > > > > > > > > -Jan-Arve
> > > > > > > > > > >
> > > > > > > > > > > 2015-10-13 0:46 GMT+02:00 Ryan Farrington
<
> > > > > > > rfarrington@remitdata.com
> > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > We are experiencing a failure
in cloudstack waiting
> for
> > > an
> > > > > > async
> > > > > > > > job
> > > > > > > > > > > > performing a live migration of
a volume to finish.
> I've
> > > > > copied
> > > > > > > the
> > > > > > > > > > > relevant
> > > > > > > > > > > > log entries below.We acknowledge
that the migration
> > will
> > > > > take a
> > > > > > > few
> > > > > > > > > > hours
> > > > > > > > > > > > based on the volume of the data
and we are looking
> for
> > a
> > > > way
> > > > > to
> > > > > > > > > > increase
> > > > > > > > > > > > the timeout of 7200 seconds into
something we know we
> > can
> > > > > work
> > > > > > > > with.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 2015-10-12 00:19:36,043 DEBUG
> > > [o.a.c.s.RemoteHostEndPoint]
> > > > > > > > > > > > (Job-Executor-62:ctx-802065a9
ctx-bb27a168) Failed to
> > > send
> > > > > > > command,
> > > > > > > > > due
> > > > > > > > > > > to
> > > > > > > > > > > > Agent:27,
> > com.cloud.exception.OperationTimedoutException:
> > > > > > > Commands
> > > > > > > > > > > > 835325398 to Host 27 timed out
after 7200
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Rafael Weingärtner
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Rafael Weingärtner
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Rafael Weingärtner
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Rafael Weingärtner
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Rafael Weingärtner
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Rafael Weingärtner
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Rafael Weingärtner
> > > >
> > >
> > >
> > >
> > > --
> > > Rafael Weingärtner
> > >
> >
> >
> >
> > --
> > Rafael Weingärtner
> >
>
>
>
> --
> Rafael Weingärtner
>



-- 
Rafael Weingärtner

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message