Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0E7A8200B38 for ; Fri, 8 Jul 2016 09:53:32 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0D01E160A58; Fri, 8 Jul 2016 07:53:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D4A26160A5A for ; Fri, 8 Jul 2016 09:53:30 +0200 (CEST) Received: (qmail 29198 invoked by uid 500); 8 Jul 2016 07:53:29 -0000 Mailing-List: contact users-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@cloudstack.apache.org Delivered-To: mailing list users@cloudstack.apache.org Received: (qmail 29176 invoked by uid 99); 8 Jul 2016 07:53:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jul 2016 07:53:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6793E1814E0; Fri, 8 Jul 2016 07:53:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.114 X-Spam-Level: **** X-Spam-Status: No, score=4.114 tagged_above=-999 required=6.31 tests=[FSL_HELO_BARE_IP_2=1.499, KAM_INFOUSMEBIZ=0.75, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_NUMERIC_HELO=0.865] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id HkitDIYgNHnu; Fri, 8 Jul 2016 07:53:25 +0000 (UTC) Received: from smtp02.mail.pcextreme.nl (smtp02.mail.pcextreme.nl [109.72.87.139]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 681B65F47B; Fri, 8 Jul 2016 07:53:24 +0000 (UTC) Received: from 109.72.87.221 (ox01.pcextreme.nl [109.72.87.221]) by smtp02.mail.pcextreme.nl (Postfix) with ESMTPSA id 5576140B02; Fri, 8 Jul 2016 09:53:17 +0200 (CEST) Date: Fri, 8 Jul 2016 09:53:17 +0200 (CEST) From: Wido den Hollander To: dev@cloudstack.apache.org, Aaron Hurt , users@cloudstack.apache.org Message-ID: <888825450.668.1467964397330@ox.pcextreme.nl> In-Reply-To: <30D74071-68F2-4A96-A861-3B2CCE7AC2BA@ena.com> References: <9059AE49-3ADC-4908-A0D2-453CDA4CCBF8@ena.com> <669216987.332.1467477476635@ox.pcextreme.nl> <7B06EE73-C483-4EE5-8D5A-DDBE82D636B7@ena.com> <1645172688.610.1467843253030@ox.pcextreme.nl> <30D74071-68F2-4A96-A861-3B2CCE7AC2BA@ena.com> Subject: Re: Ceph RBD related host agent segfault MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 Importance: Medium X-Mailer: Open-Xchange Mailer v7.8.1-Rev15 X-Originating-Client: open-xchange-appsuite archived-at: Fri, 08 Jul 2016 07:53:32 -0000 > Op 7 juli 2016 om 6:35 schreef Aaron Hurt : >=20 >=20 >=20 > > On Jul 6, 2016, at 5:14 PM, Wido den Hollander wrote: > >=20 > >>=20 > >> Op 6 juli 2016 om 16:18 schreef Aaron Hurt >: > >>=20 > >>=20 > >>=20 > >>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander > wrote: > >>>=20 > >>>>=20 > >>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt >>: > >>>>=20 > >>>>=20 > >>>> In preparation to roll a new platform built on 4.8 with a Ceph stora= ge backend we=E2=80=99ve been encountering segfaults that appear to be rela= ted to snapshot operations via java-jados (librbd) on the host agent. We= =E2=80=99ve been able to isolate this to two possible places in the code: > >>>>=20 > >>>> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/k= vm/storage/LibvirtStorageAdaptor.java > >>>>=20 > >>>> for (RbdSnapInfo snap : snaps) { > >>>> if (image.snapIsProtected(snap.name)) { > >>>> s_logger.debug("Unprotecting snapshot " + pool= .getSourceDir() + "/" + uuid + "@" + snap.name); > >>>> image.snapUnprotect(snap.name); > >>>> } else { > >>>> s_logger.debug("Snapshot " + pool.getSourceDir= () + "/" + uuid + "@" + snap.name + " is not protected."); > >>>> } > >>>> s_logger.debug("Removing snapshot " + pool.getSour= ceDir() + "/" + uuid + "@" + snap.name); > >>>> image.snapRemove(snap.name); > >>>> } > >>>>=20 > >>>> Should we be checking if the unprotect actually failed/succeeded bef= ore attempting to remove the snapshot? > >>>>=20 > >>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <= https://github.com/apache/cloudstack/pull/1230> > >>) duplicates some of this function= ality and there doesn=E2=80=99t seem to be any protection preventing delete= PhysicalDisk and the cleanup routine being run simultaneously. > >>>>=20 > >>>>=20 > >>>> To Reproduce (with ceph/rbd primary storage) > >>>>=20 > >>>> 1. Set global concurrent.snapshots.threshold.perhost to the default= NULL value > >>>> 2. Set global snapshot.poll.interval and storage.cleanup.interval t= o a low interval =E2=80=A6 10 seconds > >>>> 3. Restart management server > >>>> 4. Deploy several VMs from templates > >>>> 5. Destroy+expunge the VMs after they are running > >>>> 6. Observe segfaults in management server > >>>>=20 > >>>>=20 > >>>> Workaround > >>>>=20 > >>>> We=E2=80=99ve been able to eliminate the segfaults of the host agent= in our testing by simply setting concurrent.snapshots.threshold.perhost to= 1 even with the decreased poll intervals. > >>>>=20 > >>>> Segfault Logs > >>>>=20 > >>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d > >> > >>>>=20 > >>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 > >> > >>>>=20 > >>>> We would really appreciate any feedback and/or confirmation from the= community around the above issues. I=E2=80=99d also be happy to provide a= ny additional information needed to get this addressed. > >>>=20 > >>> What seems to be happening is that it failed to unprotect the snapsho= t of the volume. This could have various reasons, for example if there is a= child image of the snapshot. I don't think it's the case however. > >>>=20 > >>> It could still be that it tries to remove the master/golden image fro= m the template while it still has childs attached to that snapshot. > >>>=20 > >>> I'm not sure if this is due to rados-java or a bug in librados. The J= ava could should just throw a exception and not completely crash the JVM. T= his happens lower in the code and not in Java. > >>>=20 > >>> The assert shows this also happens when Java is talking to libvirt. I= guess a librados bug, but now completely sure. > >>>=20 > >>> Wido > >>=20 > >>=20 > >> We=E2=80=99re seeing this happen around other issues and it does seem = to be related to java-rados and the JNA wrappings around librbd. This is a= n exception that just occurred this morning when performing a load balancer= update. > >>=20 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource= .LibvirtComputingResource] (agentRequest-Handler-3:) (logid:7b48049b) Execu= tion is successful. > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource= .LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execu= ting: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh che= ckrouter.sh 169.254.3.93 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.= Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659= : { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.a= gent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Stat= us: BACKUP\n","wait":0}}] } > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.= Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-58712865392= 07573662: { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, [{"= com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"= uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort= ":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdde= d":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85= -33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algori= thm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"dest= inations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyA= dded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.1= 07.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked"= :false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90= .0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90= .0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4= 872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"proto= col":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"in= line":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublic= IP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.= 0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:Ad= MiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":fal= se,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable= ":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83= b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gat= eway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":= "Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecu= rityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"ro= uter.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169= .254 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: .0.11","router.name= ":"r-167-QA"},"wait":0}}] } > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.= Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing command: com.c= loud.agent.api.routing.LoadBalancerConfigCommand > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.vir= tualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec846= 69) Transforming com.cloud.agent.api.routing.LoadBalancerConfigCommand to C= onfigItems > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global se= ction: global > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global se= ction: log 127.0.0.1:3914 local0 warning > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global se= ction: maxconn 4096 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global se= ction: maxpipes 1024 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global se= ction: chroot /var/lib/haproxy > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global se= ction: user haproxy > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global se= ction: group haproxy > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) global se= ction: daemon > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: defaults > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: log global > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: mode tcp > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: option dontlognull > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: retries 3 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: option redispatch > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: option forwardfor > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: option forceclose > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: timeout connect 5000 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: timeout client 50000 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) default s= ection: timeout server 50000 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxy m= ode http enabled > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.networ= k.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) Haproxyst= ats rule: > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_pub= lic 10.107.0.6:8081 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri /admi= n?stats > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm Hapro= xy\ Statistics > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth admin= 1:AdMiN123 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource= .LibvirtComputingResource] (agentRequest-Handler-2:) (logid:4a2bd0ba) Execu= tion is successful. > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.= Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661= : { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, [{"com.cloud.a= gent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Stat= us: BACKUP\n","wait":0}}] } > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [resource.vir= tualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) (logid:aec846= 69) Processing FileConfigItem, copying 1315 characters to load_balancer.jso= n took 237ms > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG [kvm.resource= .LibvirtComputingResource] (agentRequest-Handler-1:) (logid:aec84669) Execu= ting: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh upd= ate_config.py 169.254.0.11 load_balancer.json > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.= h: In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, i= nt)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659 > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.= h: 62: FAILED assert(sub < m_subsys.size()) > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2= (45107e21c568dd033c2f0a3107dec8f0b0e58374) > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) [0= x7f534a94bda5] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) [0x= 7f534a712028] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) [0x= 7f534a743ff3] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) [0= x7f534ab9ae8e] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) [0= x7f534aba817d] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) [0x7= f5455e78dc5] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) [= 0x7f5455793ced] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the= executable, or `objdump -rdS ` is needed to interpret this. > >> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13= 461 Aborted (core dumped) /usr/lib/jvm/jre/bin/java -Xms256= m -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS > >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.s= ervice: main process exited, code=3Dexited, status=3D134/n/a > >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit cloudstack-ag= ent.service entered failed state. > >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.s= ervice failed. > >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.s= ervice holdoff time over, scheduling restart. > >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack= Agent. > >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStac= k Agent=E2=80=A6 > >>=20 > >> Is there something obvious I=E2=80=99m missing here? > >>=20 > >=20 > > I am confused. In t his path you are not even touching Ceph, but still = it crashes on librados. > >=20 > > Searching I found this issue in the Ceph tracker: http://tracker.ceph.c= om/issues/14314 > >=20 > > Isn't there a package version mismatch in your Ceph cluster? > >=20 > > Wido > >=20 >=20 > I agree it=E2=80=99s very confusing and I=E2=80=99m running out of ideas = as to what the cause may be. >=20 > Here are the package versions on all our related boxes in the lab. >=20 > http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g >=20 > I=E2=80=99ve also collected the most recent instances of our two segfault= s/aborts below. >=20 > journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted >=20 > http://sprunge.us/LcYA >=20 > journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.cep= h.rbd.RbdException >=20 > http://sprunge.us/SiCf >=20 Looking at this I come to the conclusion that this is a librados bug and no= t rados-java nor CloudStack. The crashes are happening with exactly the same backtrace. I have a few clu= sters running with Hammer 0.94.5 and they all clean up their snapshots just= fine, no crashes. > I also went back to look at the two places in the code where snapshot cle= anup is taking place in our tree: >=20 > The place where the failed to unprotect exceptions seem to be triggered: > https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervis= ors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L84= 0 >=20 > The cleanup code for rbd snapshots: > https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa= 3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMSt= orageProcessor.java#L1289 >=20 > I=E2=80=99ve rolled a small patch into our testing tree that moves the co= ntext and image cleanup into a =E2=80=98finally=E2=80=99 block because I th= ought there may be a problem with those not being closed/freed when the unp= rotect threw an exception. So the code in LibvirtStorageAdapter.java is sl= ightly different than mainline the 4.8 branch. Here is the pr/diff in our = fork that shows the changes. >=20 > https://github.com/myENA/cloudstack/pull/11/files >=20 > Does this make sense? Is this even possibly related to the issue I=E2=80= =99m seeing? The patch looks sane, you can more exceptions, but still, such a true crash= of Java/librados can't be triggered easily from Java. I truly think this i= s a librados bug. Wido >=20 > Once again, all the advice and attention is definitely appreciated. >=20 > =E2=80=94 Aaron >