cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cloudstack-fan <cloudstack-...@protonmail.com.INVALID>
Subject RE: Snapshots on KVM corrupting disk images
Date Sun, 03 Feb 2019 10:54:30 GMT
Yes, that's the scariest thing: you never know that the image is corrupted on the same day.
Usually, a week or a fortnight could pass before one gets to know about a problem (and all
old snapshots are successfully removed by that time).

Some time ago I implemented a simple script that runs `qemu-img check` on each image on a
daily basis, but then I had to give this idea up, because `qemu-img check` usually can show
a lot of errors on a running instance's volume, it could show some truth only when the instance
is stopped. :-(

Here is a bit of advice.
1. First of all, never make a snapshot when the VM shows high I/O activity. I implemented
an SNMP-agent that shows I/O activity of all VMs under a certain MIB, but I also had to implement
another application to manage snapshots, it creates a new snapshot only when it's pretty sure
that the VM doesn't write a lot of data to the storage. I'd gladly share it, but implementing
all these things could be a bit tricky thing, I need some time to document it. Of course,
you always can implement your own solution for that. Maybe it would be a nice idea to implement
this in ACS itself. :)
2. Consider dropping caches every hour (`/bin/echo 1 > /proc/sys/vm/drop_caches`). I found
some correlation between corrupting images and cache's overflow.

I'm still not 100% sure it can guarantee you calm sleeping in the night, but my statistics
(~600 VMs on different hosts, clusters, pods and zones) show that implementing these things
was a correct step (knocking on wood, spitting over the left shoulder, etc.).

Good luck!


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, 1 February 2019 22:01, Sean Lair <slair@ippathways.com> wrote:

> Hello,
>
> We are using NFS storage. It is actually native NFS mounts on a NetApp storage system.
We haven't seen those log entries, but we also don't always know when a VM gets corrupted...
When we finally get a call that a VM is having issues, we've found that it was corrupted a
while ago.
>
> -----Original Message-----
> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
> Sent: Sunday, January 27, 2019 1:45 PM
> To: users@cloudstack.apache.org
> Cc: dev@cloudstack.apache.org
> Subject: Re: Snapshots on KVM corrupting disk images
>
> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing during the last
5-6 years of using ACS with KVM hosts (see this thread, if you're interested in additional
details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>
> I'd like to state that creating snapshots of a running virtual machine is a bit risky.
I've implemented some workarounds in my environment, but I'm still not sure that they are
100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage do you use, if
it's not a secret? Does you storage use XFS as a filesystem? Did you see something like this
in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc
(mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in
kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock
size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file
when the disaster happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair slair@ippathways.com wrote:
>
> > Hi all,
> > We had some instances where VM disks are becoming corrupted when using KVM snapshots.
We are running CloudStack 4.9.3 with KVM on CentOS 7.
> > The first time was when someone mass-enabled scheduled snapshots on a lot of large
number VMs and secondary storage filled up. We had to restore all those VM disks... But believed
it was just our fault with letting secondary storage fill up.
> > Today we had an instance where a snapshot failed and now the disk image is corrupted
and the VM can't boot. here is the output of some commands:
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > We tried restoring to before the snapshot failure, but still have strange errors:
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes)
> > disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942 Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 0x55d16ddf465e
0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were
found on the image.
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We tried deleting
the snapshots in the restored disk image, but it errors out...
> > Does anyone else have issues with KVM snapshots? We are considering just disabling
this functionality now...
> > Thanks
> > Sean



Mime
View raw message