Project

General

Profile

Actions

Bug #61616

closed

[librbd] volume data corruption when using rbd-mirror w/failover

Added by Ilya Dryomov 11 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This bug creates holes (zeros in this case) in the image where there should be data. This causes corruption on the block and filesystem level. A reproducer was created to illustrate the steps to cause the problem and show associated symptoms.

The environment:

There's two sites, site-a and site-b. To start the scenario, an image is created with mirror mode.

Here's how we arrive at the bug:

1. Maps image on Primary
2. Formats then mount bdev
3a. starts manually creating 40 mirror image snapshots, every 3 seconds. This step will end 3b when completed.  
3b. untars via script provided with timeout of 5m
4. unmap then demote image on primary
5. wait for demote image to be synced to secondary
6. promote image on secondary
7. map image on secondary
8. fsck -fn image

After running the steps above a data corruption occurred.
This can be seen running fsck on the volume:

+ sudo fsck -fn /dev/rbd0
fsck from util-linux 2.38.1
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
HTREE directory inode 83315 has an invalid root node.
Clear HTree index? no

HTREE directory inode 83315 has an unsupported hash version (20)
Clear HTree index? no

HTREE directory inode 83315 uses an incompatible htree root node flag.
Clear HTree index? no
...

Steps to reproduce are:
1. In the build directory of a clone of upstream run:

sh ./rbd_reproducer_corrupt1.sh

2. This script will run until error code is returned (most likely going to be fsck).


Files

kernel_untar.sh (3.9 KB) kernel_untar.sh Christopher Hoffman, 05/26/2023 04:44 PM
mirrorenv.sh (4.66 KB) mirrorenv.sh Christopher Hoffman, 05/26/2023 04:44 PM
rbd_reproducer_corrupt1.sh (2.64 KB) rbd_reproducer_corrupt1.sh Christopher Hoffman, 05/26/2023 04:44 PM
may25-26-corrupt1 (17.1 KB) may25-26-corrupt1 full fsck output Christopher Hoffman, 05/26/2023 04:53 PM
demote-promote snap fsck.txt (5.77 KB) demote-promote snap fsck.txt Christopher Hoffman, 06/01/2023 09:13 PM
rbd_reproducer_corrupt2.sh (3.04 KB) rbd_reproducer_corrupt2.sh Christopher Hoffman, 06/01/2023 09:20 PM
keep-mirror-snaps.patch (3.14 KB) keep-mirror-snaps.patch Christopher Hoffman, 06/01/2023 09:51 PM

Related issues 4 (0 open4 closed)

Copied from Linux kernel client - Bug #61472: [krbd] volume data corruption when using rbd-mirror w/failoverResolvedIlya Dryomov

Actions
Copied to rbd - Backport #61750: pacific: [librbd] volume data corruption when using rbd-mirror w/failoverResolvedIlya DryomovActions
Copied to rbd - Backport #61751: reef: [librbd] volume data corruption when using rbd-mirror w/failoverResolvedIlya DryomovActions
Copied to rbd - Backport #61752: quincy: [librbd] volume data corruption when using rbd-mirror w/failoverResolvedIlya DryomovActions
Actions #1

Updated by Ilya Dryomov 11 months ago

  • Copied from Bug #61472: [krbd] volume data corruption when using rbd-mirror w/failover added
Actions #2

Updated by Ilya Dryomov 11 months ago

Forked from https://tracker.ceph.com/issues/61472 to track fixing librbd, as librbd is similarly affected (can be reproduced using the same script with rbd-nbd).

Actions #3

Updated by Ilya Dryomov 11 months ago

  • Status changed from New to In Progress
  • Assignee set to Ilya Dryomov
  • Backport set to pacific,quincy,reef
Actions #4

Updated by Ilya Dryomov 11 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 52109
Actions #5

Updated by Ilya Dryomov 11 months ago

  • Regression changed from No to Yes
Actions #6

Updated by Ilya Dryomov 11 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Backport Bot 11 months ago

  • Copied to Backport #61750: pacific: [librbd] volume data corruption when using rbd-mirror w/failover added
Actions #8

Updated by Backport Bot 11 months ago

  • Copied to Backport #61751: reef: [librbd] volume data corruption when using rbd-mirror w/failover added
Actions #9

Updated by Backport Bot 11 months ago

  • Copied to Backport #61752: quincy: [librbd] volume data corruption when using rbd-mirror w/failover added
Actions #10

Updated by Backport Bot 11 months ago

  • Tags set to backport_processed
Actions #11

Updated by Ilya Dryomov 6 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF