Project

General

Profile

Bug #48999

Data corruption with rbd_balance_parent_reads and rbd_balance_snap_reads set to true.

Added by chao guo almost 2 years ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
librbd, corruption, data loss, snapshot, clone, rbd_balance_parent_reads, rbd_balance_snap_reads
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We use librbd as VM storage, we discover a issue that has a great chance to cause data corruption on the snapshot and cloned rbd with these following options:
rbd_balance_parent_reads = true
rbd_balance_snap_reads = true

Steps to reproduce this bug:
1. Create a rbd image by cloning from a snapshot(do not know if cloning is necessary)
2. Take many snapshots on the cloned rbd (7 snapshots is easy enough to trigger this bug, but not every time). Create one clone image for each snapshot.
3. Create a simple python script or any other method (image is opened readonly) to read from the cloned image of the last snapshot, and do not stop reading it till the end.
4. Clone a image and instantly start a vm on it, and delete all the snapshot and their cloned image except the last one at the same time.(I do it by running in a script, even though there is a file lock between delete and clone)
5. The vm cannot startup, because the OS image is corrupted.

The last snapshot seem to be corrupted too, because any image cloned from this snapshot is corrupted.

Workaround:
Set these option to false, then this bug is gone.
rbd_balance_parent_reads = false
rbd_balance_snap_reads = false

I am using version 13.2.9 and 13.2.10. I do not know if this bug will affect any other version.

History

#1 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to rbd
  • Category deleted (librbd)

Not sure if this is an issue with rbd snap handling or with RADOS or with user expectations around changing snapshots with read balancing?

#2 Updated by Ilya Dryomov about 1 year ago

Hi Chao,

The Mimic (13.2.*) release has been EOL for over a year (the last stable release was cut in April 2020). I would encourage you to upgrade to a supported release, Octopus (currently 15.2.14) or Pacific (currently 16.2.6) and see if the issue still exists. Read from replica logic was reworked in Octopus (https://github.com/ceph/ceph/pull/32381), so it is likely fixed.

Also available in: Atom PDF