cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koushik Das (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-2428) HA - When the master host is disconnected , the host status contines to remain in "Up" state because of com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of slave
Date Wed, 18 Dec 2013 13:04:07 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13851675#comment-13851675
] 

Koushik Das commented on CLOUDSTACK-2428:
-----------------------------------------

I looked at the logs and see that XenServerInvestigator is always returning the host status
for Rack3Host20.lab.vmops.com as 'Up' till 22:06. Only at 22:07 it returned heartbeat failure
for the first time. The CheckOnHostCommand which is actually called by XS investigator is
returning that the the heartbeat is running. The config parameters xen.heartbeat.interval,
ping.interval and ping.timeout are all set to default values. I looked at the check_heartbeat.sh
script which gets called at the XS host to check if the heartbeat is alive or dead and didn't
see anything unusual.

Some of the possibilities I can think of for the behaviour seen:
- There may be a time lag between the XS hosts Rack3Host20.lab.vmops.com and Rack3Host23.lab.vmops.com.
Can you check if the hosts are time synchronised or not?
- Is it possible that the heartbeat file is getting updated through some other host from another
MS (highly unlikely but just wanted to rule it out)?


2013-12-16 22:05:59,890 INFO  [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-d57c261c) Investigating
why host 1 has disconnected with event PingTimeout
2013-12-16 22:05:59,890 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-d57c261c) checking
if agent (1) is alive
2013-12-16 22:05:59,892 DEBUG [c.c.a.t.Request] (AgentTaskPool-1:ctx-d57c261c) Seq 1-1825048696:
Sending  { Cmd , MgmtId: 112516401760401, via: 1(Rack3Host20.lab.vmops.com), Ver: v1, Flags:
100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
2013-12-16 22:05:59,892 DEBUG [c.c.a.t.Request] (AgentTaskPool-1:ctx-d57c261c) Seq 1-1825048696:
Executing:  { Cmd , MgmtId: 112516401760401, via: 1(Rack3Host20.lab.vmops.com), Ver: v1, Flags:
100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
2013-12-16 22:06:02,896 DEBUG [c.c.a.t.Request] (AgentTaskPool-1:ctx-d57c261c) Seq 1-1825048696:
Received:  { Ans: , MgmtId: 112516401760401, via: 1, Ver: v1, Flags: 10, { CheckHealthAnswer
} }
2013-12-16 22:06:02,896 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-d57c261c) Details
from executing class com.cloud.agent.api.CheckHealthCommand: resource is not alive
2013-12-16 22:06:02,897 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-1:ctx-d57c261c)
SimpleInvestigator unable to determine the state of the host.  Moving on.
2013-12-16 22:06:02,900 DEBUG [c.c.a.t.Request] (AgentTaskPool-1:ctx-d57c261c) Seq 2-120391533:
Sending  { Cmd , MgmtId: 112516401760401, via: 2(Rack3Host23.lab.vmops.com), Ver: v1, Flags:
100011, [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"eb0add58-3df6-4870-8937-ac21abe5471b","privateNetwork":{"ip":"10.223.59.66","netmask":"255.255.255.192","mac":"bc:30:5b:d4:16:39","isSecurityGroupEnabled":false},"publicNetwork":{"ip":"10.223.59.66","netmask":"255.255.255.192","mac":"bc:30:5b:d4:16:39","isSecurityGroupEnabled":false},"storageNetwork1":{"ip":"10.223.59.66","netmask":"255.255.255.192","mac":"bc:30:5b:d4:16:39","isSecurityGroupEnabled":false}},"wait":20}}]
}
2013-12-16 22:06:02,900 DEBUG [c.c.a.t.Request] (AgentTaskPool-1:ctx-d57c261c) Seq 2-120391533:
Executing:  { Cmd , MgmtId: 112516401760401, via: 2(Rack3Host23.lab.vmops.com), Ver: v1, Flags:
100011, [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"eb0add58-3df6-4870-8937-ac21abe5471b","privateNetwork":{"ip":"10.223.59.66","netmask":"255.255.255.192","mac":"bc:30:5b:d4:16:39","isSecurityGroupEnabled":false},"publicNetwork":{"ip":"10.223.59.66","netmask":"255.255.255.192","mac":"bc:30:5b:d4:16:39","isSecurityGroupEnabled":false},"storageNetwork1":{"ip":"10.223.59.66","netmask":"255.255.255.192","mac":"bc:30:5b:d4:16:39","isSecurityGroupEnabled":false}},"wait":20}}]
}
2013-12-16 22:06:03,676 DEBUG [c.c.a.t.Request] (AgentTaskPool-1:ctx-d57c261c) Seq 2-120391533:
Received:  { Ans: , MgmtId: 112516401760401, via: 2, Ver: v1, Flags: 10, { CheckOnHostAnswer
} }
2013-12-16 22:06:03,676 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-d57c261c) Details
from executing class com.cloud.agent.api.CheckOnHostCommand: Heartbeat is still going
2013-12-16 22:06:03,676 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-1:ctx-d57c261c)
XenServerInvestigator was able to determine host 1 is in Up
2013-12-16 22:06:03,676 INFO  [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-d57c261c) The
state determined is Up
2013-12-16 22:06:03,676 INFO  [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-d57c261c) Agent
is determined to be up and running
2013-12-16 22:06:03,676 DEBUG [c.c.h.Status] (AgentTaskPool-1:ctx-d57c261c) Transition:[Resource
state = Enabled, Agent event = Ping, Host id = 1, name = Rack3Host20.lab.vmops.com]
2013-12-16 22:06:03,680 DEBUG [c.c.h.Status] (AgentTaskPool-1:ctx-d57c261c) Agent status update:
[id = 1; name = Rack3Host20.lab.vmops.com; old status = Up; event = Ping; new status = Up;
old update count = 500; new update count = 501]


| Advanced | DEFAULT  | management-server | xen.heartbeat.interval | 60    | heartbeat to
use when implementing XenServer Self Fencing | NULL          | NULL    | NULL  |         
0 |
| Advanced | DEFAULT  | AgentManager | ping.interval | 60    | Interval to send application
level pings to make sure the connection is still working | 60            | 2013-12-16 18:22:22
| Global |          0 |
| Advanced | DEFAULT  | AgentManager | ping.timeout  | 2.5   | Multiplier to ping.interval
before announcing an agent has timed out                  | 2.5           | 2013-12-16 18:22:22
| Global |          1 |


> HA - When the master host is disconnected , the host status contines to remain in "Up"
state because of com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of
slave 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-2428
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2428
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.2.0, 4.3.0
>         Environment: Build from pvaln
>            Reporter: Sangeetha Hariharan
>            Assignee: Koushik Das
>            Priority: Critical
>             Fix For: 4.2.0, 4.3.0
>
>         Attachments: hostdown.rar, hostdown.rar, logs.rar, logs_7_29, management-server.rar,
masterhostdown.rar
>
>
> 1. Advance zone  with 1 cluster with 2 hosts. Create  Shared network with private vlan.
> 2. Deploy few HA enabled Vms in this network. 
> 3. pull network cable for one of the host.
> When cloudstack detects that the host is disconnected , it is not able to out the host
in disconnected state and start HA for Vms that are HA enabeld,
> I see the following exception in the management server logs:
> 2013-05-09 17:15:55,576 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-267:null)
Seq 1-1435828229: Executing request
> 2013-05-09 17:15:55,602 DEBUG [xen.resource.XenServerConnectionPool] (DirectAgent-267:null)
Catch Exception: com.xensource.xenapi.Types$HostOffline Host is offline 10.223.81.62 due to
You attempted an operation which involves a host which could not be contacted.
> 2013-05-09 17:15:55,603 DEBUG [xen.resource.XenServerConnectionPool] (DirectAgent-267:null)
Trying to reset master of slave 10.223.81.62 to 10.223.81.61
> 2013-05-09 17:16:02,319 WARN  [xen.resource.CitrixResourceBase] (DirectAgent-265:null)
 can not ping xenserver 520d4994-8b1f-4dda-b51d-2ee63750abf6
> 2013-05-09 17:16:02,319 WARN  [agent.manager.DirectAgentAttache] (DirectAgent-265:null)
Unable to get current status on 1
> 2013-05-09 17:16:02,321 INFO  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:null)
Investigating why host 1 has disconnected with event AgentDisconnected
> 2013-05-09 17:16:02,321 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-11:null)
checking if agent (1) is alive
> 2013-05-09 17:16:02,323 DEBUG [agent.transport.Request] (AgentTaskPool-11:null) Seq 1-1435828294:
Sending  { Cmd , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}]
}
> 2013-05-09 17:16:02,323 DEBUG [agent.transport.Request] (AgentTaskPool-11:null) Seq 1-1435828294:
Executing:  { Cmd , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}]
}
> 2013-05-09 17:16:02,323 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-271:null)
Seq 1-1435828294: Executing request
> 2013-05-09 17:16:04,035 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-10:null) Seq
6-474349576: Waiting some more time because this is the current command
> 2013-05-09 17:16:04,040 DEBUG [xen.resource.XenServerConnectionPool] (DirectAgent-268:null)
localLogout has problem Failed to read server's response: connect timed out
> 2013-05-09 17:16:04,040 WARN  [agent.manager.DirectAgentAttache] (DirectAgent-268:null)
Seq 1-1435828292: Exception Caught while executing command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of slave 10.223.81.62
to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: Failed to read server's response:
connect timed out
>         at com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
>         at com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:1682)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:524)
>         at com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
>         at com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
>         at com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:04,041 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-268:null)
Seq 1-1435828292: Response Received:
> 2013-05-09 17:16:04,041 DEBUG [agent.transport.Request] (DirectAgent-268:null) Seq 1-1435828292:
Processing:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10, [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException:
Failed to read server's response: connect timed out","wait":0}}] }
> 2013-05-09 17:16:04,041 DEBUG [agent.transport.Request] (AgentTaskPool-5:null) Seq 1-1435828292:
Received:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10, { Answer } }
> 2013-05-09 17:16:04,041 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-5:null)
host (10.223.81.50) cannot be pinged, returning null ('I don't know')
> 2013-05-09 17:16:04,041 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-5:null)
sending ping from (5) to agent's host ip address (10.223.81.50)
> 2013-05-09 17:16:04,043 DEBUG [agent.transport.Request] (AgentTaskPool-5:null) Seq 5-2082341067:
Sending  { Cmd , MgmtId: 7647994577963, via: 5, Ver: v1, Flags: 100011, [{"PingTestCommand":{"_computingHostIp":"10.223.81.50","wait":20}}]
}
> 2013-05-09 17:16:04,043 DEBUG [agent.transport.Request] (AgentTaskPool-5:null) Seq 5-2082341067:
Executing:  { Cmd , MgmtId: 7647994577963, via: 5, Ver: v1, Flags: 100011, [{"PingTestCommand":{"_computingHostIp":"10.223.81.50","wait":20}}]
}
> 2013-05-09 17:16:04,043 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-272:null)
Seq 5-2082341067: Executing request
> 2013-05-09 17:16:04,053 DEBUG [xen.resource.XenServerConnectionPool] (DirectAgent-91:null)
localLogout has problem Failed to read server's response: connect timed out
> 2013-05-09 17:16:04,053 WARN  [agent.manager.DirectAgentAttache] (DirectAgent-91:null)
Seq 1-1435828293: Exception Caught while executing command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of slave 10.223.81.62
to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: Failed to read server's response:
connect timed out
>         at com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
>         at com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:1682)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:524)
>         at com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
>         at com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
>         at com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:04,054 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-91:null)
Seq 1-1435828293: Response Received:
> 2013-05-09 17:16:04,054 DEBUG [agent.transport.Request] (DirectAgent-91:null) Seq 1-1435828293:
Processing:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10, [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException:
Failed to read server's response: connect timed out","wait":0}}] }
> 2013-05-09 17:16:04,055 DEBUG [agent.transport.Request] (AgentTaskPool-7:null) Seq 1-1435828293:
Received:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10, { Answer } }
> 2013-05-09 17:16:04,055 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-7:null)
host (10.223.81.52) cannot be pinged, returning null ('I don't know')
> 2013-05-09 17:16:04,055 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-7:null)
sending ping from (5) to agent's host ip address (10.223.81.52)
> 2013-05-09 17:16:04,057 DEBUG [agent.transport.Request] (AgentTaskPool-7:null) Seq 5-2082341068:
Sending  { Cmd , MgmtId: 7647994577963, via: 5, Ver: v1, Flags: 100011, [{"PingTestCommand":{"_computingHostIp":"10.223.81.52","wait":20}}]
}
> 2013-05-09 17:16:04,057 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-14:null) Seq
3-1752367195: Waiting some more time because this is the current command
> 2013-05-09 17:16:04,057 DEBUG [agent.transport.Request] (AgentTaskPool-7:null) Seq 5-2082341068:
Executing:  { Cmd , MgmtId: 7647994577963, via: 5, Ver: v1, Flags: 100011, [{"PingTestCommand":{"_computingHostIp":"10.223.81.52","wait":20}}]
}
> 2013-05-09 17:16:04,057 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-91:null)
Seq 5-2082341068: Executing request
> 2013-05-09 17:16:05,175 DEBUG [storage.secondary.SecondaryStorageManagerImpl] (secstorage-1:null)
Zone 1 is ready to launch secondary storage VM
> 2013-05-09 17:16:05,614 DEBUG [xen.resource.XenServerConnectionPool] (DirectAgent-267:null)
localLogout has problem Failed to read server's response: connect timed out
> 2013-05-09 17:16:05,614 WARN  [agent.manager.DirectAgentAttache] (DirectAgent-267:null)
Seq 1-1435828229: Exception Caught while executing command
> com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of slave 10.223.81.62
to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: Failed to read server's response:
connect timed out
>         at com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443)
>         at com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:7725)
>         at com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:570)
>         at com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73)
>         at com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102)
>         at com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-05-09 17:16:05,615 DEBUG [agent.manager.DirectAgentAttache] (DirectAgent-267:null)
Seq 1-1435828229: Response Received:
> 2013-05-09 17:16:05,615 DEBUG [agent.transport.Request] (DirectAgent-267:null) Seq 1-1435828229:
Processing:  { Ans: , MgmtId: 7647994577963, via: 1, Ver: v1, Flags: 10, [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException:
Failed to read server's response: connect timed out","wait":0}}] }
> 2013-05-09 17:16:05,704 DEB



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message