Project

General

Profile

Actions

Bug #58707

open

rbd snapshot corruption, likely due to snaptrim

Added by Roel van Meer about 1 year ago. Updated 2 months ago.

Status:
Need More Info
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
rbd snaptrim
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Dear maintainers,

We have one Ceph pool where rbd snapshots are being corrupted. This happens within hours of the snapshot creation. Corruption does not happen when snaptrim is disabled.

We're using Proxmox 7 with their Ceph packages. The problem existed with Ceph Octopus (15.2.17), and still exists with Pacific (16.2.9).

The corruption was first detected by VMs not being able to boot, or displaying severe filesystem corruption after reverting to a snapshot, and later confirmed by doing a sha256sum on a mapped snapshot periodically. Comparison of a copy of the original snapshot and a copy of a corrupted snapshot showed that several 4MB chunks that had data in the original snapshot had only zeroes in the corrupted snapshot.

This is reproducible for us, but only on the one pool of one Ceph cluster (we manage several).

Our reproduction is simple:

- Create a VM with a 10GB disk in Proxmox
- Do a standard Debian install
- Shutdown the VM
- Create a snapshot
- Start the VM

Now, run a script like this, every 10 minutes:

dev=$(rbd -p ssd map vm-9707-disk-0@snapshot)
sum=$(sha256sum $dev)
logger "Checksum of vm-9707-disk-0@snapshot is $sum"
rbd -p ssd unmap vm-9707-disk-0@snapshot

Result: The logged checksum changes. Sometimes the change happens after one or two hours, sometimes it has already changed after 10 minutes.

While running Ceph with 'nosnaptrim' for about 8 hours, no changes occurred in the snapshot's checksum.

The cluster has an OK health, and no (for me) strange messages have been showing up in the logs, although we might need to run with a higher logging level to get any useful information.

This is a production cluster, so I'm limited what I can do here, but I'm happy and willing to provide any and all information necessary to help figure this out.


Files

ceph-df.txt (736 Bytes) ceph-df.txt ceph df output Roel van Meer, 02/13/2023 01:50 PM
ceph-status.txt (855 Bytes) ceph-status.txt ceph status output Roel van Meer, 02/13/2023 01:50 PM
rbd-du.txt (35.3 KB) rbd-du.txt rbd -p ssd du - output Roel van Meer, 02/14/2023 07:05 AM
snapshot-checksum.txt (78.8 KB) snapshot-checksum.txt Checksum of vm-9707-disk-0@snapshot over time Roel van Meer, 02/14/2023 07:05 AM
rbd-info-vm-9707-disk-0.txt (406 Bytes) rbd-info-vm-9707-disk-0.txt Output of: rbd -p ssd info vm-9707-disk-0 Roel van Meer, 02/14/2023 10:56 PM
rbd-info-vm-9707-disk-0_snapshot.txt (424 Bytes) rbd-info-vm-9707-disk-0_snapshot.txt Output of: rbd -p ssd info vm-9707-disk-0@snapshot Roel van Meer, 02/14/2023 10:56 PM
9718-info.txt (856 Bytes) 9718-info.txt Roel van Meer, 02/17/2023 07:21 AM
pve09-journal.txt (615 KB) pve09-journal.txt Kernel log from the hypervisor Roel van Meer, 02/23/2023 09:48 PM
9721-info.txt (91.5 KB) 9721-info.txt System info on reproducer 9721 Roel van Meer, 02/23/2023 10:05 PM
roel-disk-4-snapshot-create.log (103 KB) roel-disk-4-snapshot-create.log Debug log of rbd snap create of roel-disk-4@snapshot Roel van Meer, 03/13/2023 01:09 PM
test-NBD-roel-disk-6.txt (3.88 KB) test-NBD-roel-disk-6.txt Documented NBD test Roel van Meer, 03/14/2023 03:00 PM
roel-disk-8.tar.gz (5.11 KB) roel-disk-8.tar.gz Output and scripts of per-block test with listsnaps output Roel van Meer, 03/14/2023 04:23 PM
ceph-osd-pool-ls-detail.txt (2.06 KB) ceph-osd-pool-ls-detail.txt Roel van Meer, 03/17/2023 01:15 PM
missing-blocks-pgs.txt (97.3 KB) missing-blocks-pgs.txt Roel van Meer, 03/20/2023 04:02 PM
ceph-osd.56.log-20230404.txt.gz (130 KB) ceph-osd.56.log-20230404.txt.gz OSD 56 log Roel van Meer, 04/04/2023 06:38 AM
ceph-osd.56.log.txt.gz (159 KB) ceph-osd.56.log.txt.gz OSD 56 log with additional snapmapper debug Roel van Meer, 04/05/2023 07:00 AM
purged-snaps-omap.tar.gz (218 KB) purged-snaps-omap.tar.gz Roel van Meer, 04/25/2023 01:45 PM
Actions

Also available in: Atom PDF