cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Hurt <ah...@ena.com>
Subject Ceph RBD related host agent segfault
Date Thu, 30 Jun 2016 16:29:55 GMT
In preparation to roll a new platform built on 4.8 with a Ceph storage backend we’ve been
encountering segfaults that appear to be related to snapshot operations via java-jados (librbd)
on the host agent.  We’ve been able to isolate this to two possible places in the code:

lines ~866-875 in plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java

                for (RbdSnapInfo snap : snaps) {
                    if (image.snapIsProtected(snap.name)) {
                        s_logger.debug("Unprotecting snapshot " + pool.getSourceDir() + "/"
+ uuid + "@" + snap.name);
                        image.snapUnprotect(snap.name);
                    } else {
                        s_logger.debug("Snapshot " + pool.getSourceDir() + "/" + uuid + "@"
+ snap.name + " is not protected.");
                    }
                    s_logger.debug("Removing snapshot " + pool.getSourceDir() + "/" + uuid
+ "@" + snap.name);
                    image.snapRemove(snap.name);
                }

Should we be checking if the unprotect actually failed/succeeded before attempting to remove
the snapshot?

Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 <https://github.com/apache/cloudstack/pull/1230>)
duplicates some of this functionality and there doesn’t seem to be any protection preventing
deletePhysicalDisk and the cleanup routine being run simultaneously.


To Reproduce (with ceph/rbd primary storage)

1.  Set global concurrent.snapshots.threshold.perhost to the default NULL value
2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low interval … 10
seconds
3.  Restart management server
4.  Deploy several VMs from templates
5.  Destroy+expunge the VMs after they are running
6.  Observe segfaults in management server


Workaround

We’ve been able to eliminate the segfaults of the host agent in our testing by simply setting
concurrent.snapshots.threshold.perhost to 1 even with the decreased poll intervals.

Segfault Logs

https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>

https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>

We would really appreciate any feedback and/or confirmation from the community around the
above issues.  I’d also be happy to provide any additional information needed to get this
addressed.

— Aaron

Mime
View raw message