cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wido den Hollander <w...@widodh.nl>
Subject Re: Ceph RBD related host agent segfault
Date Sat, 02 Jul 2016 16:37:56 GMT

> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ahurt@ena.com>:
> 
> 
> In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve
been encountering segfaults that appear to be related to snapshot operations via java-jados
(librbd) on the host agent.  We’ve been able to isolate this to two possible places in the
code:
> 
> lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
> 
>                 for (RbdSnapInfo snap : snaps) {
>                     if (image.snapIsProtected(snap.name)) {
>                         s_logger.debug("Unprotecting snapshot " + pool.getSourceDir()
+ "/" + uuid + "@" + snap.name);
>                         image.snapUnprotect(snap.name);
>                     } else {
>                         s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid
+ "@" + snap.name + " is not protected.");
>                     }
>                     s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" +
uuid + "@" + snap.name);
>                     image.snapRemove(snap.name);
>                 }
> 
> Should we be checking if the unprotect actually failed/succeeded before attempting to
remove the snapshot?
> 
> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>)
duplicates some of this functionality and there doesn’t seem to be any protection preventing
deletePhysicalDisk and the cleanup routine being run simultaneously.
> 
> 
> To Reproduce (with ceph/rbd primary storage)
> 
> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval
… 10 seconds
> 3.  Restart management server
> 4.  Deploy several VMs from templates
> 5.  Destroy+expunge the VMs after they are running
> 6.  Observe segfaults in management server
> 
> 
> Workaround
> 
> We’ve been able to eliminate the segfaults of the host agent in our testing by simply
setting concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.
> 
> Segfault Logs
> 
> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>
> 
> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>
> 
> We would really appreciate any feedback and/or confirmation from the community around
the above issues.  I’d also be happy to provide any additional information needed to get
this addressed.

What seems to be happening is that it failed to unprotect the snapshot of the volume. This
could have various reasons, for example if there is a child image of the snapshot. I don't
think it's the case however.

It could still be that it tries to remove the master/golden image from the template while
it still has childs attached to that snapshot.

I'm not sure if this is due to rados-java or a bug in librados. The Java could should just
throw a exception and not completely crash the JVM. This happens lower in the code and not
in Java.

The assert shows this also happens when Java is talking to libvirt. I guess a librados bug,
but now completely sure.

Wido

> 
> — Aaron

Mime
View raw message