cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc-Andre Jutras <mar...@marcuspocus.com>
Subject Re: cs 4.5.1, hosts stuck in disconnected status
Date Fri, 22 Jul 2016 15:40:02 GMT
Hey !! Answers through your msg...



On 2016-07-22 10:32 AM, Scheurer François wrote:
> Hi Marcus
>
>
>
>
>
> Many thanks for your answer.
>
>
>
>
>
>> Did you have any load balancer in front of your 3 CSMAN servers? if so, is there
any persistence defined in your configuration ? Can you try to set it to SourceIP and fix
the timeout to something like 60 or 120 min ?
> Yes we have a haproxy with balance source. But the timeout are only 5 min, I will extend
them to 60min as you proposed.
>

Great !! that should stabilize your SSVM // CVM connectivity...

>
>
>
>> under global settings / host, make sure your Xen hosts, VM or System VM can reach
the IP defined there...
> Yes the domain from the global parameters is reachable from Xen and System VM's under
tcp 8250. (ping is also ok)
>
>
>
>
>
>> iptables : make sure these tcp port are open on each of your CSMAN servers... : 8080,
8096, 8250, 9090 ( and also validate that you got these ports open on your Load balancer too...
)
> Yes all 4 ports are opened in the CSMAN iptables.
>
> But in the LB we opened only 8080 (for UI/CS API) and 8250 (privately for System VM's).
>
> I thought 8096 and 9090 are only needed between the CSMAN's (for unauthenticated API
calls from scripts and for pings)

Correct... 8096 is required for API and 9090 for HA between all CSMAN 
server, If you're not exposing the API, you don't need to map 8096... 
and I personally set 9090 just to have another point of monitoring 
available but it's up to you to add this to your haproxy config or 
not... ( not required )

>
>
>
>
>
>> if your zone is set to Advanced mode, make sure each of your xenserver is running
openvswitch ( xe-switch-network-backend openvswitch ) if not, ( basic mode ) set it to bridge...
( xe-switch-network-backend bridge ) ( more info:
> http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.6/hypervisor/xenserver.html#install-cloudstack-xenserver-support-package-csp
>
> )
>
> Yes this is fine.
>
>
>
>
>
>> check also each iptables definition in each of your xen server, to test, flush all
tables and check if Cloudstack can connect correctly to it...
> ( iptables -F      iptables definition in : /etc/sysconfig/iptables )
>
>> you can also try to delete one xenhost and re-add it to cloudstack and check in the
CS logs if you're seeing some files copied to the host...
> Is it possible to delete and re-add a xenhost with VM's running on it? Or do we need
to evacuate them first?

You need to move VM somewhere else before deleting a server... put this 
server in maintenance mode: this should force all VM to be migrated to 
others hypervisor...

>
> We also found that solution from older messages from the maillist. But we finally got
the xen hosts reconnected by simply stopping all CSMAN's and restarting a single CSMAN
>
> After all Xen Hosts got connected we could start the other CSMAN.
>
>
>
> As I wrote in previous message, we suspect that the issue was caused by entries in in
the op_host_transfer table.
>
> It seems that if a Xen Host get transferred from one CSMAN to another (rebalance) and
if that later CSMAN get stopped before completing the transfer, then this table entries stay
forever in the DB and the CSMAN never try again to reconnect those Xen Hosts.
>
> This is just a speculation, may be you can confirm this.


it will probably stay forever as disconnected or in error state in the 
db mainly because your CSMAN server cannot re-communicate with each 
Xenhost...

is there any firewall or a router between your CSMAN and your Xenserver 
or an IDS / IPS who can play with tcp packet ?

This error message ring me a bell :

IOException Broken pipe when sending data to peer 345049098498, close peer connection and
let it re-open


Broken pipe : just like if something external to Cloudstack or Xen have 
reset the communication between your CSMAN server and your Xenserver...

and quick question: your Xenserver got a static IP ( no dhcp ?  : dhcp 
could generate a communication reset...)

Can you also try to install the latest service pack and latest hotfixes 
on your Xenserver ?
( http://xenserver.org/open-source-virtualization-download.html )

>
>
>
> The main log entries to support this explanations are:
>
> ===>11 = vh006, 345049098122 = man03, vh006 is transfered to man03:
>
>    2016-07-18 14:03:28,744 DEBUG [c.c.a.m.ClusteredAgentAttache] (StatsCollector-1:ctx-814f1ae1)
Seq 11-5143110774457106438: Forwarding null to 345049098122
>
>    2016-07-18 14:03:28,838 DEBUG [c.c.a.t.Request] (StatsCollector-1:ctx-814f1ae1) Seq
11-5143110774457106438: Received: { Ans: , MgmtId: 345049103441, via: 11, Ver: v1, Flags:
10, { GetHostStatsAnswer } }
>
> ===>19 = vh010, 345049098498 = man01, vh010 is transfered to man01, but man01 is stopping
and starting at 14:02:47, so the transfer failed:
>
> ! 2016-07-18 14:03:28,851 DEBUG [c.c.a.m.ClusteredAgentAttache] (StatsCollector-1:ctx-814f1ae1)
Seq 19-2009731333714083845: Forwarding null to 345049098498
>
>    2016-07-18 14:03:28,852 DEBUG [c.c.a.m.ClusteredAgentAttache] (StatsCollector-1:ctx-814f1ae1)
Seq 19-2009731333714083845: Error on connecting to management node: null try = 1
>
>    2016-07-18 14:03:28,852 INFO [c.c.a.m.ClusteredAgentAttache] (StatsCollector-1:ctx-814f1ae1)
IOException Broken pipe when sending data to peer 345049098498, close peer connection and
let it re-open
>
>    2016-07-18 14:03:28,856 WARN [c.c.a.m.AgentManagerImpl] (StatsCollector-1:ctx-814f1ae1)
Exception while sending java.lang.NullPointerExceptio
>
>
>
> See more details in my previous post.
>
>
>
>
>
> I have another question: the cloudstack documentation says that the tcp port 8250 is
used for system vm’s (console proxy & secondary storage) to connect to the CSMAN’s.
>
> Is it true that the Xen Hosts does not use this port?

true : 8250 is only used by the SSVM, CVM and VR / VPC

>
> AFAIK the Xen Hosts only get connections from the CSMAN’s (tcp 22/80/443) but never
iniate connections to them. Is that correct?

Correct, it's Cloudstack who will initiate the connection to the 
Xenserver...

>
>
>
>
>
> Many Thanks to all contributors!
>
> It is really amazing to see such good and reactive support from a free maillist.
>
>
>
> Best Regards
>
> Francois Scheurer
>
>
>
>
>
>
>
> -----Original Message-----
> From: Marc-Andre Jutras [mailto:marcus@marcuspocus.com]
> Sent: Thursday, July 21, 2016 8:10 PM
> To: users@cloudstack.apache.org
> Subject: Re: cs 4.5.1, hosts stuck in disconnected status
>
>
>
> Hey Francois,
>
>
>
> here is some suggestion...
>
>
>
> Did you have any load balancer in front of your 3 CSMAN servers? if so, is there any
persistence defined in your configuration ? Can you try to set it to SourceIP and fix the
timeout to something like 60 or 120 min ?
>
>
>
> Also validate these points:
>
>
>
> under global settings / host, make sure your Xen hosts, VM or System VM can reach the
IP defined there...
>
>
>
> iptables : make sure these tcp port are open on each of your CSMAN servers... : 8080,
8096, 8250, 9090 ( and also validate that you got these ports open on your Load balancer too...
)
>
>
>
> if your zone is set to Advanced mode, make sure each of your xenserver is running openvswitch
( xe-switch-network-backend openvswitch ) if not, ( basic mode ) set it to bridge... ( xe-switch-network-backend
bridge ) ( more info:
>
> http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.6/hypervisor/xenserver.html#install-cloudstack-xenserver-support-package-csp
>
> )
>
>
>
> check also each iptables definition in each of your xen server, to test, flush all tables
and check if Cloudstack can connect correctly to it...
>
> ( iptables -F      iptables definition in : /etc/sysconfig/iptables )
>
>
>
> you can also try to delete one xenhost and re-add it to cloudstack and check in the CS
logs if you're seeing some files copied to the host...
>
>
>
> try that and keep us posted !
>
>
>
> Marcus
>
>
>
>
>
> On 2016-07-21 10:50 AM, Scheurer François wrote:
>
>> Dear Stephan and Dag,
>> we also thought about it and checked it but the host was already enabled on xen.
>> Best Regards
>> Francois
>> EveryWare AG
>> François Scheurer
>> Senior Systems Engineer
>> -----Original Message-----
>> From: Dag Sonstebo [mailto:Dag.Sonstebo@shapeblue.com]
>> Sent: Thursday, July 21, 2016 1:23 PM
>> To: users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>
>> Subject: Re: cs 4.5.1, hosts stuck in disconnected status
>> Hi Francois,
>> As pointed out by Stephan the problem is probably with your Xen cluster rather than
your CloudStack management. On the disconnected host you may want to carry out a xe-toolstack-restart
- this will restart Xapi without affecting running Vms. After that check your cluster with
‘xe host-list’ etc. If this doesn’t help you may have to consider restarting the host.
>> Regards,
>> Dag Sonstebo
>> Cloud Architect
>> ShapeBlue
>> On 21/07/2016, 11:25, "Francois Scheurer" <francois.scheurer@everyware.ch<mailto:francois.scheurer@everyware.ch>>
wrote:
>>> Dear CS contributors
>>> We could fix the issue without having to restart the disconnected Xen Hosts.
>>> We suspect that the root cause was a interrupted agent transfer,
>>> during a restart of a Managment Server (CSMAN).
>>> We have 3 CSMAN's running in cluster: man01, man02 and man03.
>>> The disconnected vh010 belongs to one Xen Hosts Cluster with 4 nodes:
>>> vh009, vh010, vh011 and vh012.
>>> See the chronological events from the logs with our comments
>>> regarding the disconnection of vh010:
>>> ===>vh010 (host 19) was on agent 345049103441 (man02)
>>>       vh010: Last Disconnected   2016-07-18T14:03:50+0200
>>>       345049098498 = man01
>>>       345049103441 = man02
>>>       345049098122 = man03
>>>       ewcstack-man02-prod:
>>>           2016-07-18T14:00:34.878973+02:00 ewcstack-man02-prod [audit
>>> root/10467 as root/10467 on
>>> pts/1/192.168.252.77:36251->192.168.225.72:22] /root: service
>>> cloudstack-management restart; service cloudstack-usage restart
>>>       ewcstack-man02-prod:
>>>           2016-07-18 14:02:15,797 DEBUG [c.c.s.StorageManagerImpl]
>>> (StorageManager-Scavenger-1:ctx-ea98efd4) Storage pool garbage
>>> collector found 0 templates to clean up in storage pool:
>>> ewcstack-vh010-prod Local Storage
>>>       !    2016-07-18 14:02:26,699 DEBUG
>>> [c.c.a.m.ClusteredAgentManagerImpl] (StatsCollector-1:ctx-7da7a491)
>>> Host
>>> 19 has switched to another management server, need to update agent
>>> map with a forwarding agent attache
>>>       ewcstack-man01-prod:
>>>           2016-07-18T14:02:47.317644+02:00 ewcstack-man01-prod [audit
>>> root/11094 as root/11094 on
>>> pts/0/192.168.252.77:40654->192.168.225.71:22] /root: service
>>> cloudstack-management restart; service cloudstack-usage restart;
>>>       ewcstack-man02-prod:
>>>           2016-07-18 14:03:24,859 DEBUG [c.c.s.StorageManagerImpl]
>>> (StorageManager-Scavenger-1:ctx-c39aaa53) Storage pool garbage
>>> collector found 0 templates to clean up in storage pool:
>>> ewcstack-vh010-prod Local Storage
>>>       ewcstack-man02-prod:
>>>           2016-07-18 14:03:26,260 DEBUG [c.c.a.m.AgentManagerImpl]
>>> (AgentManager-Handler-6:null) SeqA 256-29401: Sending Seq 256-29401:
>>> {
>>> Ans: , MgmtId: 345049103441, via: 256, Ver: v1, Flags: 100010,
>>> [{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
>>>           2016-07-18 14:03:28,535 DEBUG [c.c.s.StatsCollector]
>>> (StatsCollector-1:ctx-814f1ae1) HostStatsCollector is running...
>>>           2016-07-18 14:03:28,553 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Forwarding
>>> null to 345049098122
>>>           2016-07-18 14:03:28,661 DEBUG [c.c.a.m.AgentManagerImpl]
>>> (AgentManager-Handler-7:null) SeqA 244-153489: Processing Seq
>>> 244-153489:  { Cmd , MgmtId: -1, via: 244, Ver: v1, Flags: 11,
>>> [{"com.cloud.agent.api.ConsoleProxyLoadReportCommand":{"_proxyVmId":1
>>> 456,"_loadInfo":"{\n
>>> \"connections\": []\n}","wait":0}}] }
>>>           2016-07-18 14:03:28,667 DEBUG [c.c.a.m.AgentManagerImpl]
>>> (AgentManager-Handler-7:null) SeqA 244-153489: Sending Seq 244-153489:
>>> { Ans: , MgmtId: 345049103441, via: 244, Ver: v1, Flags: 100010,
>>> [{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
>>>           2016-07-18 14:03:28,731 DEBUG [c.c.a.t.Request]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Received:
>>> {
>>> Ans: , MgmtId: 345049103441, via: 7, Ver: v1, Flags: 10, {
>>> GetHostStatsAnswer } }
>>> ===>11 = vh006, 345049098122 = man03, vh006 is transfered to man03:
>>>           2016-07-18 14:03:28,744 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438:
>>> Forwarding null to 345049098122
>>>           2016-07-18 14:03:28,838 DEBUG [c.c.a.t.Request]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438: Received:
>>> {
>>> Ans: , MgmtId: 345049103441, via: 11, Ver: v1, Flags: 10, {
>>> GetHostStatsAnswer } }
>>> ===>19 = vh010, 345049098498 = man01, vh010 is transfered to man01,
>>> but
>>> man01 is stopping and starting at 14:02:47, so the transfer failed:
>>>       !    2016-07-18 14:03:28,851 DEBUG [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845:
>>> Forwarding null to 345049098498
>>>           2016-07-18 14:03:28,852 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:28,852 INFO [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) IOException Broken pipe when sending
>>> data to peer 345049098498, close peer connection and let it re-open
>>>           2016-07-18 14:03:28,856 WARN  [c.c.a.m.AgentManagerImpl]
>>> (StatsCollector-1:ctx-814f1ae1) Exception while sending
>>>           java.lang.NullPointerException
>>>                   at
>>> com.cloud.agent.manager.ClusteredAgentManagerImpl.connectToPeer(ClusteredAgentManagerImpl.java:527)
>>>                   at
>>> com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttache.java:177)
>>>                   at
>>> com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:395)
>>>                   at
>>> com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:433)
>>>                   at
>>> com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:362)
>>>                   at
>>> com.cloud.agent.manager.AgentManagerImpl.easySend(AgentManagerImpl.java:919)
>>>                   at
>>> com.cloud.resource.ResourceManagerImpl.getHostStatistics(ResourceManagerImpl.java:2460)
>>>                   at
>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>>                   at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>                   at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>                   at java.lang.reflect.Method.invoke(Method.java:606)
>>>                   at
>>> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
>>>                   at
>>> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
>>>                   at
>>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
>>>                   at
>>> org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
>>>                   at
>>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
>>>                   at
>>> org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
>>>                   at com.sun.proxy.$Proxy149.getHostStatistics(Unknown
>>> Source)
>>>                   at
>>> com.cloud.server.StatsCollector$HostCollector.runInContext(StatsCollector.java:325)
>>>                   at
>>> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>>>                   at
>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>>>                   at
>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>>>                   at
>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>>>                   at
>>> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>>>                   at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>                   at
>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>>>                   at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>>                   at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>                   at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>                   at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>                   at java.lang.Thread.run(Thread.java:745)
>>>           2016-07-18 14:03:28,857 WARN  [c.c.r.ResourceManagerImpl]
>>> (StatsCollector-1:ctx-814f1ae1) Unable to obtain host 19 statistics.
>>>           2016-07-18 14:03:28,857 WARN  [c.c.s.StatsCollector]
>>> (StatsCollector-1:ctx-814f1ae1) Received invalid host stats for host:
>>> 19
>>>           2016-07-18 14:03:28,870 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 21-6297439653947506693: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:28,887 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 25-2894407185515675660: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:28,903 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 29-4279264070932103175: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:28,919 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 33-123567514775977989: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:29,057 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 224-4524428775647084550: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:29,170 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083846: Error on
>>> connecting to management node: null try = 1
>>> ===>vh010 is invalid and stays disconnected:
>>>       !    2016-07-18 14:03:29,174 WARN  [c.c.r.ResourceManagerImpl]
>>> (StatsCollector-1:ctx-814f1ae1) Unable to obtain GPU stats for host
>>> ewcstack-vh010-prod
>>>           2016-07-18 14:03:29,183 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 21-6297439653947506694: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:29,196 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 25-2894407185515675661: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:29,212 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 29-4279264070932103176: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:29,226 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 33-123567514775977990: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:29,282 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-1:ctx-814f1ae1) Seq 224-4524428775647084551: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:30,246 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-2:ctx-942dd66c) Seq 19-2009731333714083847: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:30,302 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-2:ctx-942dd66c) Seq 21-6297439653947506695: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:30,352 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-2:ctx-942dd66c) Seq 25-2894407185515675662: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:30,381 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-2:ctx-942dd66c) Seq 29-4279264070932103177: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:30,421 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-2:ctx-942dd66c) Seq 33-123567514775977991: Error on
>>> connecting to management node: null try = 1
>>>           2016-07-18 14:03:30,691 DEBUG
>>> [c.c.a.m.ClusteredAgentAttache]
>>> (StatsCollector-2:ctx-942dd66c) Seq 224-4524428775647084552: Error on
>>> connecting to management node: null try = 1
>>> The Table op_host_transfer shows 3 Transfers, that were not completed:
>>> für id 3,15,19 = vh007, vh011, vh010:
>>>       mysql> select * from op_host_transfer ;
>>> +-----+------------------------+-----------------------+-------------------+---------------------+
>>>       | id  | initial_mgmt_server_id | future_mgmt_server_id |
>>> state             | created             |
>>> +-----+------------------------+-----------------------+-------------------+---------------------+
>>>       |   3 |           345049103441 |          345049098122 |
>>> TransferRequested | 2016-07-13 14:46:57 |
>>>       |  15 |           345049103441 |          345049098122 |
>>> TransferRequested | 2016-07-14 16:15:11 |
>>>       |  19 |           345049098498 |          345049103441 |
>>> TransferRequested | 2016-07-18 12:03:39 |
>>>       | 130 |           345049103441 |          345049098498 |
>>> TransferRequested | 2016-07-13 14:52:00 |
>>>       | 134 |           345049103441 |          345049098498 |
>>> TransferRequested | 2016-07-03 08:54:40 |
>>>       | 150 |           345049103441 |          345049098498 |
>>> TransferRequested | 2016-07-13 14:52:00 |
>>>       | 158 |           345049103441 |          345049098498 |
>>> TransferRequested | 2016-07-03 08:54:41 |
>>>       | 221 |           345049103441 |          345049098498 |
>>> TransferRequested | 2016-07-13 14:52:00 |
>>>       | 232 |           345049103441 |          345049098498 |
>>> TransferRequested | 2016-07-03 08:54:41 |
>>>       | 244 |           345049103441 |          345049098498 |
>>> TransferRequested | 2016-07-13 14:52:00 |
>>>       | 248 |           345049103441 |          345049098498 |
>>> TransferRequested | 2016-07-03 08:54:41 |
>>>       | 250 |           345049098122 |          345049103441 |
>>> TransferRequested | 2016-07-15 18:54:35 |
>>>       | 251 |           345049103441 |          345049098122 |
>>> TransferRequested | 2016-07-16 09:06:12 |
>>>       | 252 |           345049103441 |          345049098122 |
>>> TransferRequested | 2016-07-18 11:22:06 |
>>>       | 253 |           345049103441 |          345049098122 |
>>> TransferRequested | 2016-07-16 09:06:13 |
>>>       | 254 |           345049103441 |          345049098122 |
>>> TransferRequested | 2016-07-18 11:22:07 |
>>>       | 255 |           345049098122 |          345049098498 |
>>> TransferRequested | 2016-07-18 12:05:40 |
>>> +-----+------------------------+-----------------------+-------------------+---------------------+
>>> Analysis:
>>> A rolling restart of all 3 CSMANs (one-by-one) seems to have caused
>>> these 3 uncompleted transfers and seems to be the cause of the hosts
>>> stucked in Disconnected status.
>>> If we stop all CSMAN's and start a single one (for ex. man03), then
>>> these 3 uncompleted transfers disappeared and the hosts get connected
>>> automatically.
>>> It is probably also possible to delete them manually in the
>>> op_host_transfer. (can you confirm this?)
>>> We also discovered an issue with loopback devices that are not
>>> removed after a stop of the CMSAN.
>>> Conclusion:
>>> Problem: xen hosts get and stay forever disconnected.
>>> Solution:
>>>       stop all CSMAN
>>>           losetup -a
>>>           losetup -d /dev/loop{0..7}
>>>           mysql> update host set
>>> status="Up",resource_state="Enabled",mgmt_server_id=<CSMAN-ID> where
>>> id=<HOST-ID>;
>>>           mysql> update op_host_capacity set capacity_state="Enabled"
>>> where host_id=<HOST-ID>;
>>>           mysql> delete op_host_transfer where id=<HOST-ID>;
>>>       optional:
>>>           on xen server host:
>>>               xe-toolstack-restart; sleep 60
>>>               xe host-list params=enabled
>>>               xe host-enable host=<hostname>
>>>       start a single CSMAN
>>>       restart all System VM's (Secondary Storage and Console Proxy)
>>>       wait until all hosts are connected
>>>       start all other CSMAN's
>>> Useful:
>>>       mysql> select id,name,uuid,status,type, mgmt_server_id from host
>>> where removed is NULL;
>>>       mysql> select * from mshost;
>>>       mysql> select * from op_host_transfer;
>>>       mysql> select * from mshost where removed is NULL;
>>>       mysql> select * from host_tags;
>>>       mysql> select * from mshost_peer;
>>>       mysql> select * from op_host_capacity order by host_id;
>>> Best regards
>>> Francois Scheurer
>>> On 21.07.2016 11:56, Francois Scheurer wrote:
>>>> Dear CS contributors
>>>> We use CS 4.5.1 on a 3 Clusters with XenServer 6.5.
>>>> One Host in a cluster (and another in another cluster as well) got
>>>> and stayed in status "Disconnected".
>>>> We tried to unmanage/remanage the cluster to force a reconnection,
>>>> we also destroyed all System VM's (virtual console and secondary
>>>> storage VM's), we restarted all management servers.
>>>> We verified on the xen server that it is not disabled, we restarted
>>>> the xen toolstack.
>>>> We also updated the host table to put a mgmt_server_id: update host
>>>> set
>>>> status="Up",resource_state="Disabled",mgmt_server_id="345049103441"
>>>> where id=15;
>>>> Then we restarted the management servers again and also the System VM's.
>>>> We finally updated the table to without mgmt_server_id: update host
>>>> set status="Alert",resource_state="Disabled",mgmt_server_id=NULL
>>>> where id=15; Then we restarted the management servers again and also
>>>> the System VM's.
>>>> Nothing helps, the server does not reconnect.
>>>> Calling ForceReconnect shows this error:
>>>> 2016-07-18 11:26:07,418 DEBUG [c.c.a.ApiServlet]
>>>> (catalina-exec-13:ctx-4e82fdce) ===START===  192.168.252.77 -- GET
>>>> command=reconnectHost&id=3490cfa0-b2a7-4a12-aa5e-7e351ce9df00&respon
>>>> se=json&sessionkey=Tnc9l6aaSvc8J5SNy3Z71FLXgEI%3D&_=1468833953948
>>>> 2016-07-18 11:26:07,450 INFO [o.a.c.f.j.i.AsyncJobMonitor]
>>>> (API-Job-Executor-23:ctx-fc340a8e job-148672) Add job-148672 into
>>>> job monitoring
>>>> 2016-07-18 11:26:07,453 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>>> (catalina-exec-13:ctx-4e82fdce ctx-9c696de2) submit async
>>>> job-148672,
>>>> details: AsyncJobVO {id:148672, userId: 51, accountId: 51,
>>>> instanceType: Host, instanceId: 15, cmd:
>>>> org.apache.cloudstack.api.command.admin.host.ReconnectHostCmd,
>>>> cmdInfo:
>>>> {"id":"3490cfa0-b2a7-4a12-aa5e-7e351ce9df00","response":"json","sess
>>>> ionkey":"Tnc9l6aaSvc8J5SNy3Z71FLXgEI\u003d","ctxDetails":"{\"com.clo
>>>> ud.host.Host\":\"3490cfa0-b2a7-4a12-aa5e-7e351ce9df00\"}","cmdEventT
>>>> ype":"HOST.RECONNECT","ctxUserId":"51","httpmethod":"GET","_":"14688
>>>> 33953948","uuid":"3490cfa0-b2a7-4a12-aa5e-7e351ce9df00","ctxAccountI
>>>> d":"51","ctxStartEventId":"18026840"},
>>>> cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0,
>>>> result: null, initMsid: 345049098122, completeMsid: null, lastUpdated:
>>>> null, lastPolled: null, created: null}
>>>> 2016-07-18 11:26:07,454 DEBUG [c.c.a.ApiServlet]
>>>> (catalina-exec-13:ctx-4e82fdce ctx-9c696de2) ===END===
>>>> 192.168.252.77
>>>> -- GET
>>>> command=reconnectHost&id=3490cfa0-b2a7-4a12-aa5e-7e351ce9df00&respon
>>>> se=json&sessionkey=Tnc9l6aaSvc8J5SNy3Z71FLXgEI%3D&_=1468833953948
>>>> 2016-07-18 11:26:07,455 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>>> (API-Job-Executor-23:ctx-fc340a8e job-148672) Executing AsyncJobVO
>>>> {id:148672, userId: 51, accountId: 51, instanceType: Host, instanceId:
>>>> 15, cmd:
>>>> org.apache.cloudstack.api.command.admin.host.ReconnectHostCmd,
>>>> cmdInfo:
>>>> {"id":"3490cfa0-b2a7-4a12-aa5e-7e351ce9df00","response":"json","sess
>>>> ionkey":"Tnc9l6aaSvc8J5SNy3Z71FLXgEI\u003d","ctxDetails":"{\"com.clo
>>>> ud.host.Host\":\"3490cfa0-b2a7-4a12-aa5e-7e351ce9df00\"}","cmdEventT
>>>> ype":"HOST.RECONNECT","ctxUserId":"51","httpmethod":"GET","_":"14688
>>>> 33953948","uuid":"3490cfa0-b2a7-4a12-aa5e-7e351ce9df00","ctxAccountI
>>>> d":"51","ctxStartEventId":"18026840"},
>>>> cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0,
>>>> result: null, initMsid: 345049098122, completeMsid: null, lastUpdated:
>>>> null, lastPolled: null, created: null}
>>>> 2016-07-18 11:26:07,461 DEBUG [c.c.a.m.DirectAgentAttache]
>>>> (DirectAgent-495:ctx-77e68e88) Seq 213-6743858967010618892:
>>>> Executing request
>>>> 2016-07-18 11:26:07,467 INFO  [c.c.a.m.AgentManagerImpl]
>>>> (API-Job-Executor-23:ctx-fc340a8e job-148672 ctx-0061c491) Unable to
>>>> disconnect host because it is not connected to this server: 15
>>>> 2016-07-18 11:26:07,467 WARN [o.a.c.a.c.a.h.ReconnectHostCmd]
>>>> (API-Job-Executor-23:ctx-fc340a8e job-148672 ctx-0061c491) Exception:
>>>> org.apache.cloudstack.api.ServerApiException: Failed to reconnect host
>>>>       at
>>>> org.apache.cloudstack.api.command.admin.host.ReconnectHostCmd.execute(ReconnectHostCmd.java:109)
>>>>       at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:141)
>>>>       at
>>>> com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:108)
>>>>       at
>>>> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:537)
>>>>       at
>>>> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>>>>       at
>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>>>>       at
>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>>>>       at
>>>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>>>>       at
>>>> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>>>>       at
>>>> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:494)
>>>>       at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>       at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>       at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>       at java.lang.Thread.run(Thread.java:745)
>>>> Connecting via SSH from the management server is fine, for ex.:
>>>>     [root@ewcstack-man03-prod ~]# ssh -i
>>>> /var/cloudstack/management/.ssh/id_rsa root@ewcstack-vh011-prod
>>>> "/opt/cloud/bin/router_proxy.sh netusage.sh 169.254.2.103 -g"
>>>>     root@ewcstack-vh011-prod's password:
>>>>     2592:0:0:0:[root@ewcstack-man03-prod ~]#
>>>> Any Idea how to solve this issue and how to track the reason of the
>>>> failure to reconnect?
>>>> Many thanks in advance for your help.
>>>> Best Regards
>>>> Francois
>>> --
>>> EveryWare AG
>>> François Scheurer
>>> Senior Systems Engineer
>>> Zurlindenstrasse 52a
>>> CH-8003 Zürich
>>> tel: +41 44 466 60 00
>>> fax: +41 44 466 60 10
>>> mail: francois.scheurer@everyware.ch<mailto:francois.scheurer@everyware.ch>
>>> web: http://www.everyware.ch
>> Dag.Sonstebo@shapeblue.com<mailto:Dag.Sonstebo@shapeblue.com>
>> www.shapeblue.com<http://www.shapeblue.com>
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>


Mime
View raw message