cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Remi Bergsma <RBerg...@schubergphilis.com>
Subject Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout
Date Mon, 04 Jan 2016 10:44:43 GMT
Hi Andrei,

Missed that mail, sorry. I created a PR that allows for longer timeouts [1].

Also, you can bump the router.aggregation.command.each.timeout global setting to say 15-30
so it will allow to boot.

Next, we need to find why it takes so long in the first place. In our environment it at least
starts now.

Regards,
Remi

[1] https://github.com/apache/cloudstack/pull/1291





On 04/01/16 11:41, "Andrei Mikhailovsky" <andrei@arhont.com> wrote:

>Hello guys,
>
>Tried the user's mailing list without any luck. Perhaps the dev guys know if this issue
is being looked at for the next release?
>
>I've just upgraded to 4.6.2 and have similar issues with three virtual routers out of
22 in total. They are all failing exactly the same way as described here.
>
>Has anyone found a permanent workaround for this issue?
>
>Thanks
>
>Andrei
>
>----- Original Message -----
>> From: "Stephan Seitz" <s.seitz@secretresearchfacility.com>
>> To: "users" <users@cloudstack.apache.org>
>> Sent: Monday, 30 November, 2015 19:53:57
>> Subject: Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout
>
>> Does anybody else experiemce problems due to (very) slow deployment of
>> VRs?
>> 
>> 
>> Am Dienstag, den 24.11.2015, 16:31 +0100 schrieb Stephan Seitz:
>>> Update / FYI:
>>> After faking the particular VRu in sql, I tried to restart that
>>> network,
>>> and it always fails. To me it looks like the update_config.py - which
>>> takes almost all cpu ressources - runs way longer any watchdog will
>>> accept.
>>> 
>>> I'm able to mitigate that by very nasty workarounds:
>>> a) start the router
>>> b) wait until its provisioned
>>> c) restart cloudstack-management
>>> d)  update vm_instance
>>> set state='Running',
>>> power_state='PowerOn' where name = 'r-XXX-VM';
>>> e) once: update domain_router
>>> set template_version="Cloudstack Release 4.6.0 Wed Nov 4 08:22:47 UTC
>>> 2015",
>>> scripts_version="546c9e7ac38e0aa16ecc498899dac8e2"
>>> where id=XXX;
>>> f) wait until update_config.py finishes (for me thats about 15
>>> minutes)
>>> 
>>> Since I expect the need for VR restarts in the future, this behaviour
>>> is
>>> somehow unsatisfying. It needs a lot of errorprone intervention.
>>> 
>>> I'm quite unsure if it's introduced with the update or the particular
>>> VR
>>> just has simply not been restarted after getting configured with lots
>>> of
>>> ips and rules.
>>> 
>>> 
>>> Am Dienstag, den 24.11.2015, 12:29 +0100 schrieb Stephan Seitz:
>>> > Hi List!
>>> > 
>>> > After upgrading from 4.5.2 to 4.6.0 I faced a problem with one
>>> > virtualrouter. This particular VR has about 10 IPs w/ LB and FW
>>> > rules
>>> > defined. During the upgrade process, and after about 4-5 minutes a
>>> > watchdog kicks in and kills the respective VR due to no response.
>>> > 
>>> > So far I didn't find any timeout value in the global settings.
>>> > Temporarily setting network.router.EnableServiceMonitoring to false
>>> > doesn't change the behaviour.
>>> > 
>>> > Any help, how to mitigate that nasty timeout would be really
>>> > appreciated :)
>>> > 
>>> > cheers,
>>> > 
>>> > Stephan
>>> > 
>>> > From within the VR, the logs show
>>> > 
>>> > 2015-11-24 11:24:33,807  CsFile.py search:123 Searching for
>>> > dhcp-range=interface:eth0,set:interface and replacing with
>>> > dhcp-range=interface:eth0,set:interface-eth0,10.10.22.1,static
>>> > 2015-11-24 11:24:33,808  merge.py load:56 Creating data bag type
>>> > guestnetwork
>>> > 2015-11-24 11:24:33,808  CsFile.py search:123 Searching for
>>> > dhcp-option=tag:interface-eth0,15 and replacing with
>>> > dhcp-option=tag:interface-eth0,15,heinlein.cloudservice
>>> > 2015-11-24 11:24:33,808  CsFile.py search:123 Searching for
>>> > dhcp-option=tag:interface-eth0,6 and replacing with
>>> > dhcp-option=tag:interface
>>> > -eth0,6,10.10.22.1,195.10.208.2,91.198.250.2
>>> > 2015-11-24 11:24:33,809  CsFile.py search:123 Searching for
>>> > dhcp-option=tag:interface-eth0,3, and replacing with
>>> > dhcp-option=tag:interface-eth0,3,10.10.22.1
>>> > 2015-11-24 11:24:33,809  CsFile.py search:123 Searching for
>>> > dhcp-option=tag:interface-eth0,1, and replacing with
>>> > dhcp-option=tag:interface-eth0,1,255.255.255.0
>>> > 2015-11-24 11:24:33,810  CsHelper.py execute:160 Executing: service
>>> > dnsmasq restart
>>> > 
>>> > ==> /var/log/messages <==
>>> > Nov 24 11:24:34 r-504-VM shutdown[6752]: shutting down for system
>>> > halt
>>> > 
>>> > Broadcast message from root@r-504-VM (Tue Nov 24 11:24:34 2015):
>>> > 
>>> > The system is going down for system halt NOW!
>>> > Nov 24 11:24:35 r-504-VM KVP: KVP starting; pid is:6844
>>> > 
>>> > ==> /var/log/cloud.log <==
>>> > /opt/cloud/bin/vr_cfg.sh: line 60:  6603
>>> > Killed                  /opt/cloud/bin/update_config.py
>>> > vm_dhcp_entry.json
>>> > 
>>> > ==> /var/log/messages <==
>>> > Nov 24 11:24:35 r-504-VM cloud: VR config: executing
>>> > failed: /opt/cloud/bin/update_config.py vm_dhcp_entry.json
>>> > 
>>> > ==> /var/log/cloud.log <==
>>> > Tue Nov 24 11:24:35 UTC 2015 : VR config: executing
>>> > failed: /opt/cloud/bin/update_config.py vm_dhcp_entry.json
>>> > Connection to 169.254.2.192 closed by remote host.
>>> > Connection to 169.254.2.192 closed.
>>> > 
>>> > 
>>> > the management-server.log shows
>>> > 
>>> > 2015-11-24 12:24:43,015 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>> > (Work-Job-Executor-1:ctx-ad9e4658 job-5163/job-5164) Done executing
>>> > com.cloud.vm.VmWorkStart for job-5164
>>> > 2015-11-24 12:24:43,017 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
>>> > (Work-Job-Executor-1:ctx-ad9e4658 job-5163/job-5164) Remove job
>>> > -5164
>>> > from job monitoring
>>> > 2015-11-24 12:24:43,114 ERROR [c.c.a.ApiAsyncJobDispatcher]
>>> > (API-Job-Executor-1:ctx-760da779 job-5163) Unexpected exception
>>> > while
>>> > executing org.apache.cloudstack.api.command.admin.
>>> > router.StartRouterCmd
>>> > com.cloud.exception.AgentUnavailableException: Resource [Host:1] is
>>> > unreachable: Host 1: Unable to start instance due to Unable to
>>> > start
>>> > VM[DomainRouter|r-504-VM] due to error in f
>>> > inalizeStart, not retrying
>>> >         at
>>> > com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>> > ineManagerImpl.java:1121)
>>> >         at
>>> > com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>> > ineManagerImpl.java:4580)
>>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> > Method)
>>> >         at
>>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp
>>> > l.java:57)
>>> >         at
>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
>>> > essorImpl.java:43)
>>> >         at java.lang.reflect.Method.invoke(Method.java:606)
>>> >         at
>>> > com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandler
>>> > Proxy.java:107)
>>> >         at
>>> > com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachi
>>> > neManagerImpl.java:4736)
>>> >         at
>>> > com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:10
>>> > 2)
>>> >         at
>>> > org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl
>>> > $5.runInContext(AsyncJobManagerImpl.java:537)
>>> >         at
>>> > org.apache.cloudstack.managed.context.ManagedContextRunnable
>>> > $1.run(ManagedContextRunnable.java:49)
>>> >         at
>>> > org.apache.cloudstack.managed.context.impl.DefaultManagedContext
>>> > $1.call(DefaultManagedContext.java:56)
>>> >         at
>>> > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.ca
>>> > llWithContext(DefaultManagedContext.java:103)
>>> >         at
>>> > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.ru
>>> > nWithContext(DefaultManagedContext.java:53)
>>> >         at
>>> > org.apache.cloudstack.managed.context.ManagedContextRunnable.run(Ma
>>> > nagedContextRunnable.java:46)
>>> >         at
>>> > org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl
>>> > $5.run(AsyncJobManagerImpl.java:494)
>>> >         at java.util.concurrent.Executors
>>> > $RunnableAdapter.call(Executors.java:471)
>>> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> >         at
>>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecuto
>>> > r.java:1145)
>>> >         at java.util.concurrent.ThreadPoolExecutor
>>> > $Worker.run(ThreadPoolExecutor.java:615)
>>> >         at java.lang.Thread.run(Thread.java:745)
>>> > Caused by: com.cloud.utils.exception.ExecutionException: Unable to
>>> > start
>>> > VM[DomainRouter|r-504-VM] due to error in finalizeStart, not
>>> > retrying
>>> >         at
>>> > com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>> > ineManagerImpl.java:1085)
>>> >         at
>>> > com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>> > ineManagerImpl.java:4580)
>>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> > Method)
>>> >         ... 18 more
>>> > 2015-11-24 12:24:43,115 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>> > (API-Job-Executor-1:ctx-760da779 job-5163) Complete async job-5163,
>>> > jobStatus: FAILED, resultCode: 530, result: org.
>>> > apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[
>>> > ],"errorcode":530,"errortext":"Resource [Host:1] is unreachable:
>>> > Host 1: Unable to start instance due to Unable t
>>> > o start VM[DomainRouter|r-504-VM] due to error in finalizeStart,
>>> > not
>>> > retrying"}
>>> > 
>>> > 
>>> > 
>>> > 
>>> > 
>>> > 
>>>
Mime
View raw message