cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cloudstack-fan <cloudstack-...@protonmail.com.INVALID>
Subject Re: Snapshots on KVM corrupting disk images
Date Sun, 14 Jul 2019 10:06:08 GMT
Dear colleagues,

Have anyone upgraded to 4.11.3? This version includes a patch that should help to avoid encountering
with this problem: https://github.com/apache/cloudstack/pull/3194. It would be great to know
if it has helped you.

Thanks in advance for sharing your experience.

Best regards,
a big CloudStack fan :)

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, 5 February 2019 12:25, cloudstack-fan <cloudstack-fan@protonmail.com> wrote:

> And one more thought, by the way.
>
> There's a cool new feature - asynchronous backup (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot).
It allows to create a snapshot at one moment and back it up in another. It would be amazing
if it gave opportunity to perform the snapshot deletion procedure (I mean deletion from a
primary storage) as a separate operation. So I could check if I/O-activity is low before to
_delete_ a snapshot from a primary storage, not only before to _create_ it, it could be a
nice workaround.
>
> Dear colleagues, what do you think, is it doable?
>
> Thank you!
>
> Best regards,
> a big CloudStack fan :)
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, 4 February 2019 07:46, cloudstack-fan <cloudstack-fan@protonmail.com>
wrote:
>
>> By the way, RedHat recommended to suspend a VM before deleting a snapshot too: https://bugzilla.redhat.com/show_bug.cgi?id=920020.
I'll quote it here:
>>
>>> 1. Pause the VM
>>>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>>>      of the running VM, not with an external qemu-img process. virsh may or may
>>>      not provide an interface for this.
>>>   3. You can resume the VM now
>>>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>>>   5. Pause the VM again
>>>   6. 'delvm' in the qemu monitor
>>>   7. Resume the VM
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Monday, 4 February 2019 07:36, cloudstack-fan <cloudstack-fan@protonmail.com>
wrote:
>>
>>> I'd also like to add another detail, if no one minds.
>>>
>>> Sometimes one can run into this issue without shutting down a VM. The disaster
might occur right after a snapshot is copied to a secondary storage and deleted from the VM's
image on the primary storage. I saw it a couple of times, when it happened to the VMs being
monitored. The monitoring suite showed that these VMs were working fine right until the final
phase (apart from a short pause of the snapshot creating stage).
>>>
>>> I also noticed that a VM is always suspended when a snapshot is being created
and `virsh list` shows it's in the "paused" state, but when a snapshot is being deleted from
the image the same command always shows the "running" state, although the VM doesn't respond
to anything during the snapshot deletion phase.
>>>
>>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the
same issue (see https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/,
https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ and other similar
threads), but it also would be great to make some workaround for ACS. Maybe, just as you proposed,
it would be wise to suspend the VM before snapshot deletion and resume it after that. It would
give ACS a serious advantage over other orchestration systems. :-)
>>>
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <kudryavtsev_ia@bw-sw.com>
wrote:
>>>
>>>> Yes, only after the VM shutdown, the image is corrupted.
>>>>
>>>> пт, 1 февр. 2019 г., 15:01 Sean Lair slair@ippathways.com:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp
storage system.  We haven't seen those log entries, but we also don't always know when a VM
gets corrupted...  When we finally get a call that a VM is having issues, we've found that
it was corrupted a while ago.
>>>>>
>>>>> -----Original Message-----
>>>>> From: cloudstack-fan [mailto:cloudstack-fan@protonmail.com.INVALID]
>>>>> Sent: Sunday, January 27, 2019 1:45 PM
>>>>> To: users@cloudstack.apache.org
>>>>> Cc: dev@cloudstack.apache.org
>>>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>>>
>>>>> Hello Sean,
>>>>>
>>>>> It seems that you've encountered the same issue that I've been facing
during the last 5-6 years of using ACS with KVM hosts (see this thread, if you're interested
in additional details: https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>>>
>>>>> I'd like to state that creating snapshots of a running virtual machine
is a bit risky. I've implemented some workarounds in my environment, but I'm still not sure
that they are 100% effective.
>>>>>
>>>>> I have a couple of questions, if you don't mind. What kind of storage
do you use, if it's not a secret? Does you storage use XFS as a filesystem? Did you see something
like this in your log-files?
>>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory allocation
deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory
allocation deadlock size 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages
in your log-file when the disaster happened?
>>>>>
>>>>> I hope, things will be well. Wish you good luck and all the best!
>>>>>
>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>> On Tuesday, 22 January 2019 18:30, Sean Lair <slair@ippathways.com>
wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We had some instances where VM disks are becoming corrupted when
using KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>>>>
>>>>>> The first time was when someone mass-enabled scheduled snapshots
on a lot of large number VMs and secondary storage filled up. We had to restore all those
VM disks... But believed it was just our fault with letting secondary storage fill up.
>>>>>>
>>>>>> Today we had an instance where a snapshot failed and now the disk
image is corrupted and the VM can't boot. here is the output of some commands:
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
check
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>>> Could not read snapshots: File too large
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
info
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>>> Could not read snapshots: File too large
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> -----------------------------------------------------------
>>>>>>
>>>>>> We tried restoring to before the snapshot failure, but still have
strange errors:
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> --------------
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
info
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> file format: qcow2
>>>>>> virtual size: 50G (53687091200 bytes)
>>>>>> disk size: 73G
>>>>>> cluster_size: 65536
>>>>>> Snapshot list:
>>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>>> 3099:35:55.242
>>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>>> 3431:52:23.942 Format specific information:
>>>>>> compat: 1.1
>>>>>> lazy refcounts: false
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
check
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541
0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d
No errors were found on the image.
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> Snapshot list:
>>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>>> 3099:35:55.242
>>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>>> 3431:52:23.942
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ---------------------------------------------------------------
>>>>>>
>>>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We
tried deleting the snapshots in the restored disk image, but it errors out...
>>>>>>
>>>>>> Does anyone else have issues with KVM snapshots? We are considering
just disabling this functionality now...
>>>>>>
>>>>>> Thanks
>>>>>> Sean
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message