cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Remi Bergsma <RBerg...@schubergphilis.com>
Subject Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout
Date Tue, 05 Jan 2016 05:49:05 GMT
Hi Andrei,

Did you try it in combination with the patch I created (PR1291)? You need both changes. 

Regards, Remi 

Sent from my iPhone

> On 04 Jan 2016, at 22:17, Andrei Mikhailovsky <andrei@arhont.com> wrote:
> 
> Hi Remi,
> 
> Thanks for your reply. However, your suggestion of increasing the router.aggregation.command.each.timeout
didn't help. I've tried setting the value to 120 at no avail. Still fails with the same error.
> 
> Andrei
> 
> ----- Original Message -----
>> From: "Remi Bergsma" <RBergsma@schubergphilis.com>
>> To: "dev" <dev@cloudstack.apache.org>
>> Sent: Monday, 4 January, 2016 10:44:43
>> Subject: Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout
> 
>> Hi Andrei,
>> 
>> Missed that mail, sorry. I created a PR that allows for longer timeouts [1].
>> 
>> Also, you can bump the router.aggregation.command.each.timeout global setting to
>> say 15-30 so it will allow to boot.
>> 
>> Next, we need to find why it takes so long in the first place. In our
>> environment it at least starts now.
>> 
>> Regards,
>> Remi
>> 
>> [1] https://github.com/apache/cloudstack/pull/1291
>> 
>> 
>> 
>> 
>> 
>>> On 04/01/16 11:41, "Andrei Mikhailovsky" <andrei@arhont.com> wrote:
>>> 
>>> Hello guys,
>>> 
>>> Tried the user's mailing list without any luck. Perhaps the dev guys know if
>>> this issue is being looked at for the next release?
>>> 
>>> I've just upgraded to 4.6.2 and have similar issues with three virtual routers
>>> out of 22 in total. They are all failing exactly the same way as described
>>> here.
>>> 
>>> Has anyone found a permanent workaround for this issue?
>>> 
>>> Thanks
>>> 
>>> Andrei
>>> 
>>> ----- Original Message -----
>>>> From: "Stephan Seitz" <s.seitz@secretresearchfacility.com>
>>>> To: "users" <users@cloudstack.apache.org>
>>>> Sent: Monday, 30 November, 2015 19:53:57
>>>> Subject: Re: upgrading 4.5.2 -> 4.6.0 virtualrouter upgrade timeout
>>> 
>>>> Does anybody else experiemce problems due to (very) slow deployment of
>>>> VRs?
>>>> 
>>>> 
>>>> Am Dienstag, den 24.11.2015, 16:31 +0100 schrieb Stephan Seitz:
>>>>> Update / FYI:
>>>>> After faking the particular VRu in sql, I tried to restart that
>>>>> network,
>>>>> and it always fails. To me it looks like the update_config.py - which
>>>>> takes almost all cpu ressources - runs way longer any watchdog will
>>>>> accept.
>>>>> 
>>>>> I'm able to mitigate that by very nasty workarounds:
>>>>> a) start the router
>>>>> b) wait until its provisioned
>>>>> c) restart cloudstack-management
>>>>> d)  update vm_instance
>>>>> set state='Running',
>>>>> power_state='PowerOn' where name = 'r-XXX-VM';
>>>>> e) once: update domain_router
>>>>> set template_version="Cloudstack Release 4.6.0 Wed Nov 4 08:22:47 UTC
>>>>> 2015",
>>>>> scripts_version="546c9e7ac38e0aa16ecc498899dac8e2"
>>>>> where id=XXX;
>>>>> f) wait until update_config.py finishes (for me thats about 15
>>>>> minutes)
>>>>> 
>>>>> Since I expect the need for VR restarts in the future, this behaviour
>>>>> is
>>>>> somehow unsatisfying. It needs a lot of errorprone intervention.
>>>>> 
>>>>> I'm quite unsure if it's introduced with the update or the particular
>>>>> VR
>>>>> just has simply not been restarted after getting configured with lots
>>>>> of
>>>>> ips and rules.
>>>>> 
>>>>> 
>>>>> Am Dienstag, den 24.11.2015, 12:29 +0100 schrieb Stephan Seitz:
>>>>>> Hi List!
>>>>>> 
>>>>>> After upgrading from 4.5.2 to 4.6.0 I faced a problem with one
>>>>>> virtualrouter. This particular VR has about 10 IPs w/ LB and FW
>>>>>> rules
>>>>>> defined. During the upgrade process, and after about 4-5 minutes
a
>>>>>> watchdog kicks in and kills the respective VR due to no response.
>>>>>> 
>>>>>> So far I didn't find any timeout value in the global settings.
>>>>>> Temporarily setting network.router.EnableServiceMonitoring to false
>>>>>> doesn't change the behaviour.
>>>>>> 
>>>>>> Any help, how to mitigate that nasty timeout would be really
>>>>>> appreciated :)
>>>>>> 
>>>>>> cheers,
>>>>>> 
>>>>>> Stephan
>>>>>> 
>>>>>> From within the VR, the logs show
>>>>>> 
>>>>>> 2015-11-24 11:24:33,807  CsFile.py search:123 Searching for
>>>>>> dhcp-range=interface:eth0,set:interface and replacing with
>>>>>> dhcp-range=interface:eth0,set:interface-eth0,10.10.22.1,static
>>>>>> 2015-11-24 11:24:33,808  merge.py load:56 Creating data bag type
>>>>>> guestnetwork
>>>>>> 2015-11-24 11:24:33,808  CsFile.py search:123 Searching for
>>>>>> dhcp-option=tag:interface-eth0,15 and replacing with
>>>>>> dhcp-option=tag:interface-eth0,15,heinlein.cloudservice
>>>>>> 2015-11-24 11:24:33,808  CsFile.py search:123 Searching for
>>>>>> dhcp-option=tag:interface-eth0,6 and replacing with
>>>>>> dhcp-option=tag:interface
>>>>>> -eth0,6,10.10.22.1,195.10.208.2,91.198.250.2
>>>>>> 2015-11-24 11:24:33,809  CsFile.py search:123 Searching for
>>>>>> dhcp-option=tag:interface-eth0,3, and replacing with
>>>>>> dhcp-option=tag:interface-eth0,3,10.10.22.1
>>>>>> 2015-11-24 11:24:33,809  CsFile.py search:123 Searching for
>>>>>> dhcp-option=tag:interface-eth0,1, and replacing with
>>>>>> dhcp-option=tag:interface-eth0,1,255.255.255.0
>>>>>> 2015-11-24 11:24:33,810  CsHelper.py execute:160 Executing: service
>>>>>> dnsmasq restart
>>>>>> 
>>>>>> ==> /var/log/messages <==
>>>>>> Nov 24 11:24:34 r-504-VM shutdown[6752]: shutting down for system
>>>>>> halt
>>>>>> 
>>>>>> Broadcast message from root@r-504-VM (Tue Nov 24 11:24:34 2015):
>>>>>> 
>>>>>> The system is going down for system halt NOW!
>>>>>> Nov 24 11:24:35 r-504-VM KVP: KVP starting; pid is:6844
>>>>>> 
>>>>>> ==> /var/log/cloud.log <==
>>>>>> /opt/cloud/bin/vr_cfg.sh: line 60:  6603
>>>>>> Killed                  /opt/cloud/bin/update_config.py
>>>>>> vm_dhcp_entry.json
>>>>>> 
>>>>>> ==> /var/log/messages <==
>>>>>> Nov 24 11:24:35 r-504-VM cloud: VR config: executing
>>>>>> failed: /opt/cloud/bin/update_config.py vm_dhcp_entry.json
>>>>>> 
>>>>>> ==> /var/log/cloud.log <==
>>>>>> Tue Nov 24 11:24:35 UTC 2015 : VR config: executing
>>>>>> failed: /opt/cloud/bin/update_config.py vm_dhcp_entry.json
>>>>>> Connection to 169.254.2.192 closed by remote host.
>>>>>> Connection to 169.254.2.192 closed.
>>>>>> 
>>>>>> 
>>>>>> the management-server.log shows
>>>>>> 
>>>>>> 2015-11-24 12:24:43,015 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>>>>> (Work-Job-Executor-1:ctx-ad9e4658 job-5163/job-5164) Done executing
>>>>>> com.cloud.vm.VmWorkStart for job-5164
>>>>>> 2015-11-24 12:24:43,017 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
>>>>>> (Work-Job-Executor-1:ctx-ad9e4658 job-5163/job-5164) Remove job
>>>>>> -5164
>>>>>> from job monitoring
>>>>>> 2015-11-24 12:24:43,114 ERROR [c.c.a.ApiAsyncJobDispatcher]
>>>>>> (API-Job-Executor-1:ctx-760da779 job-5163) Unexpected exception
>>>>>> while
>>>>>> executing org.apache.cloudstack.api.command.admin.
>>>>>> router.StartRouterCmd
>>>>>> com.cloud.exception.AgentUnavailableException: Resource [Host:1]
is
>>>>>> unreachable: Host 1: Unable to start instance due to Unable to
>>>>>> start
>>>>>> VM[DomainRouter|r-504-VM] due to error in f
>>>>>> inalizeStart, not retrying
>>>>>>        at
>>>>>> com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>>>>> ineManagerImpl.java:1121)
>>>>>>        at
>>>>>> com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>>>>> ineManagerImpl.java:4580)
>>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>> Method)
>>>>>>        at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp
>>>>>> l.java:57)
>>>>>>        at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
>>>>>> essorImpl.java:43)
>>>>>>        at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>        at
>>>>>> com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandler
>>>>>> Proxy.java:107)
>>>>>>        at
>>>>>> com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachi
>>>>>> neManagerImpl.java:4736)
>>>>>>        at
>>>>>> com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:10
>>>>>> 2)
>>>>>>        at
>>>>>> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl
>>>>>> $5.runInContext(AsyncJobManagerImpl.java:537)
>>>>>>        at
>>>>>> org.apache.cloudstack.managed.context.ManagedContextRunnable
>>>>>> $1.run(ManagedContextRunnable.java:49)
>>>>>>        at
>>>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext
>>>>>> $1.call(DefaultManagedContext.java:56)
>>>>>>        at
>>>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.ca
>>>>>> llWithContext(DefaultManagedContext.java:103)
>>>>>>        at
>>>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.ru
>>>>>> nWithContext(DefaultManagedContext.java:53)
>>>>>>        at
>>>>>> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(Ma
>>>>>> nagedContextRunnable.java:46)
>>>>>>        at
>>>>>> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl
>>>>>> $5.run(AsyncJobManagerImpl.java:494)
>>>>>>        at java.util.concurrent.Executors
>>>>>> $RunnableAdapter.call(Executors.java:471)
>>>>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>>        at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecuto
>>>>>> r.java:1145)
>>>>>>        at java.util.concurrent.ThreadPoolExecutor
>>>>>> $Worker.run(ThreadPoolExecutor.java:615)
>>>>>>        at java.lang.Thread.run(Thread.java:745)
>>>>>> Caused by: com.cloud.utils.exception.ExecutionException: Unable to
>>>>>> start
>>>>>> VM[DomainRouter|r-504-VM] due to error in finalizeStart, not
>>>>>> retrying
>>>>>>        at
>>>>>> com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>>>>> ineManagerImpl.java:1085)
>>>>>>        at
>>>>>> com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMach
>>>>>> ineManagerImpl.java:4580)
>>>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>> Method)
>>>>>>        ... 18 more
>>>>>> 2015-11-24 12:24:43,115 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>>>>> (API-Job-Executor-1:ctx-760da779 job-5163) Complete async job-5163,
>>>>>> jobStatus: FAILED, resultCode: 530, result: org.
>>>>>> apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[
>>>>>> ],"errorcode":530,"errortext":"Resource [Host:1] is unreachable:
>>>>>> Host 1: Unable to start instance due to Unable t
>>>>>> o start VM[DomainRouter|r-504-VM] due to error in finalizeStart,
>>>>>> not
>>>>>> retrying"}
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 

Mime
View raw message