cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Hurt <ah...@ena.com>
Subject Re: Ceph RBD related host agent segfault
Date Thu, 07 Jul 2016 04:35:24 GMT

> On Jul 6, 2016, at 5:14 PM, Wido den Hollander <wido@widodh.nl> wrote:
> 
>> 
>> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ahurt@ena.com <mailto:ahurt@ena.com>>:
>> 
>> 
>> 
>>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <wido@widodh.nl <mailto:wido@widodh.nl>>
wrote:
>>> 
>>>> 
>>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com <mailto:ahurt@ena.com>
<mailto:ahurt@ena.com <mailto:ahurt@ena.com>>>:
>>>> 
>>>> 
>>>> In preparation to roll a new platform built on 4.8 with a Ceph storage backend
we’ve been encountering segfaults that appear to be related to snapshot operations via java-jados
(librbd) on the host agent.  We’ve been able to isolate this to two possible places in the
code:
>>>> 
>>>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>>>> 
>>>>               for (RbdSnapInfo snap : snaps) {
>>>>                   if (image.snapIsProtected(snap.name)) {
>>>>                       s_logger.debug("Unprotecting snapshot " + pool.getSourceDir()
+ "/" + uuid + "@" + snap.name);
>>>>                       image.snapUnprotect(snap.name);
>>>>                   } else {
>>>>                       s_logger.debug("Snapshot " + pool.getSourceDir() +
"/" + uuid + "@" + snap.name + " is not protected.");
>>>>                   }
>>>>                   s_logger.debug("Removing snapshot " + pool.getSourceDir()
+ "/" + uuid + "@" + snap.name);
>>>>                   image.snapRemove(snap.name);
>>>>               }
>>>> 
>>>> Should we be checking if the unprotect actually failed/succeeded before attempting
to remove the snapshot?
>>>> 
>>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>
<https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>
<https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>
<https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>>>)
duplicates some of this functionality and there doesn’t seem to be any protection preventing
deletePhysicalDisk and the cleanup routine being run simultaneously.
>>>> 
>>>> 
>>>> To Reproduce (with ceph/rbd primary storage)
>>>> 
>>>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL
value
>>>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low
interval … 10 seconds
>>>> 3.  Restart management server
>>>> 4.  Deploy several VMs from templates
>>>> 5.  Destroy+expunge the VMs after they are running
>>>> 6.  Observe segfaults in management server
>>>> 
>>>> 
>>>> Workaround
>>>> 
>>>> We’ve been able to eliminate the segfaults of the host agent in our testing
by simply setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll
intervals.
>>>> 
>>>> Segfault Logs
>>>> 
>>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>
<https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>
<https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>
<https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>>
>>>> 
>>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>
<https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>
<https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>
<https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>>
>>>> 
>>>> We would really appreciate any feedback and/or confirmation from the community
around the above issues.  I’d also be happy to provide any additional information needed
to get this addressed.
>>> 
>>> What seems to be happening is that it failed to unprotect the snapshot of the
volume. This could have various reasons, for example if there is a child image of the snapshot.
I don't think it's the case however.
>>> 
>>> It could still be that it tries to remove the master/golden image from the template
while it still has childs attached to that snapshot.
>>> 
>>> I'm not sure if this is due to rados-java or a bug in librados. The Java could
should just throw a exception and not completely crash the JVM. This happens lower in the
code and not in Java.
>>> 
>>> The assert shows this also happens when Java is talking to libvirt. I guess a
librados bug, but now completely sure.
>>> 
>>> Wido
>> 
>> 
>> We’re seeing this happen around other issues and it does seem to be related to
java-rados and the JNA wrappings around librbd.  This is an exception that just occurred this
morning when performing a load balancer update.
>> 
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:) (logid:7b48049b) Execution is successful.
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:) (logid:4a2bd0ba) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh
checkrouter.sh 169.254.3.93
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:)
(logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1,
Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status:
BACKUP\n","wait":0}}] }
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:)
(logid:aec84669) Request:Seq 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1,
Ver: v1, Flags: 100001, [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name":"r-167-QA"},"wait":0}}]
}
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:)
(logid:aec84669) Processing command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource]
(agentRequest-Handler-1:) (logid:aec84669) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand
to ConfigItems
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) global section: global
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) global section:         log 127.0.0.1:3914   local0
warning
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) global section:         maxconn 4096
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) global section:         maxpipes 1024
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) global section:         chroot /var/lib/haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) global section:         user haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) global section:         group haproxy
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) global section:         daemon
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section: defaults
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         log     global
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         mode    tcp
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         option  dontlognull
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         retries 3
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         option redispatch
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         option forwardfor
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         option forceclose
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         timeout connect    5000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         timeout client     50000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) default section:         timeout server     50000
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) Haproxy mode http enabled
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.network.HAProxyConfigurator]
(agentRequest-Handler-1:) (logid:aec84669) Haproxystats rule:
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 10.107.0.6:8081
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ Statistics
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:) (logid:4a2bd0ba) Execution is successful.
>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:)
(logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1,
Flags: 10, [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status:
BACKUP\n","wait":0}}] }
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.virtualnetwork.VirtualRoutingResource]
(agentRequest-Handler-1:) (logid:aec84669) Processing FileConfigItem, copying 1315 characters
to load_balancer.json took 237ms
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-1:) (logid:aec84669) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh
update_config.py 169.254.0.11 load_balancer.json
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In function
'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread 7f530dff3700 time
2016-07-06 08:59:09.143659
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: FAILED
assert(sub < m_subsys.size())
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0x7f534a94bda5]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x7f534a712028]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x7f534a743ff3]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0x7f534ab9ae8e]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0x7f534aba817d]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7f5455e78dc5]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [0x7f5455793ced]
>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the executable,
or `objdump -rdS <executable>` is needed to interpret this.
>> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 Aborted
                (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp "$CLASSPATH"
$JAVA_CLASS
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: main
process exited, code=exited, status=134/n/a
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-agent.service
entered failed state.
>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service failed.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service holdoff
time over, scheduling restart.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…
>> 
>> Is there something obvious I’m missing here?
>> 
> 
> I am confused. In t his path you are not even touching Ceph, but still it crashes on
librados.
> 
> Searching I found this issue in the Ceph tracker: http://tracker.ceph.com/issues/14314
<http://tracker.ceph.com/issues/14314>
> 
> Isn't there a package version mismatch in your Ceph cluster?
> 
> Wido
> 

I agree it’s very confusing and I’m running out of ideas as to what the cause may be.

Here are the package versions on all our related boxes in the lab.

http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g

I’ve also collected the most recent instances of our two segfaults/aborts below.

journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted

http://sprunge.us/LcYA

journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.ceph.rbd.RbdException

http://sprunge.us/SiCf

I also went back to look at the two places in the code where snapshot cleanup is taking place
in our tree:

The place where the failed to unprotect exceptions seem to be triggered:
https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840
<https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840>

The cleanup code for rbd snapshots:
https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289
<https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289>

I’ve rolled a small patch into our testing tree that moves the context and image cleanup
into a ‘finally’ block because I thought there may be a problem with those not being closed/freed
when the unprotect threw an exception.  So the code in LibvirtStorageAdapter.java is slightly
different than mainline the 4.8 branch.  Here is the pr/diff in our fork that shows the changes.

https://github.com/myENA/cloudstack/pull/11/files

Does this make sense?  Is this even possibly related to the issue I’m seeing?

Once again, all the advice and attention is definitely appreciated.

— Aaron


Mime
View raw message