Project

General

Profile

Actions

Bug #55803

closed

[rbd-mirror] primary snapshot in-use by replayer can be unlinked and removed

Added by Ilya Dryomov almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus,pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

CreatePrimaryRequest::unlink_peer() invoked via "rbd mirror image snapshot" command or via rbd_support mgr module when creating a new scheduled mirror snapshot at rbd_mirroring_max_mirroring_snapshots capacity on the primary cluster can race with Replayer::unlink_peer() invoked by rbd-mirror when finishing syncing an older snapshot on the secondary cluster. Consider the following:

   [ primary: primary-snap1, primary-snap2, primary-snap3
     secondary: non-primary-snap1 (complete), non-primary-snap2 (syncing) ]

0. rbd-mirror is syncing snap1..snap2 delta
1. rbd_support creates primary-snap4
2. due to rbd_mirroring_max_mirroring_snapshots == 3, rbd_support picks primary-snap3 for unlinking
3. rbd-mirror finishes syncing snap1..snap2 delta and marks non-primary-snap2 complete

   [ snap1 (the old base) is no longer needed on either cluster ]

4. rbd-mirror unlinks and removes primary-snap1
5. rbd-mirror removes non-primary-snap1
6. rbd-mirror picks snap2 as the new base
7. rbd-mirror creates non-primary-snap3 and starts syncing snap2..snap3 delta

   [ primary: primary-snap2, primary-snap3, primary-snap4
     secondary: non-primary-snap2 (complete), non-primary-snap3 (syncing) ]

8. rbd_support unlinks and removes primary-snap3 which is in-use by rbd-mirror

If snap trimming on the primary cluster kicks in soon enough, the secondary image becomes corrupted: rbd-mirror would eventually finish "syncing" non-primary-snap3 and mark it complete in spite of bogus data in the HEAD -- the primary cluster OSDs would start returning ENOENT for snap trimmed objects. Luckily, rbd-mirror's attempt to pick snap3 as the new base would wedge the replayer with "split-brain detected: failed to find matching non-primary snapshot in remote image" error:

2022-05-31T09:05:32.317-0400 7fb191c04700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_local_mirror_snapshots:                                                      
2022-05-31T09:05:32.317-0400 7fb191c04700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_local_mirror_snapshots: local mirror snapshot: id=7, mirror_ns=[mirror state=non-primary, complete=1, mirror_peer_uuids=, primary_mirror_uuid=e1260ffa-0678-4a98-a264-4d7da43b071c, primary_snap_id=7, last_copied_object_number=5120, snap_seqs={7=18446744073709551614}]                                                                                                                                                                                                                                                      
2022-05-31T09:05:32.317-0400 7fb191c04700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_local_mirror_snapshots: found local mirror snapshot: local_snap_id_start=7, local_snap_id_end=18446744073709551614, local_snap_ns=[mirror state=non-primary, complete=1, mirror_peer_uuids=, primary_mirror_uuid=e1260ffa-0678-4a98-a264-4d7da43b071c, primary_snap_id=7, last_copied_object_number=5120, snap_seqs={7=18446744073709551614}]                                                                                             
2022-05-31T09:05:32.317-0400 7fb191c04700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_remote_mirror_snapshots:
2022-05-31T09:05:32.317-0400 7fb191c04700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_remote_mirror_snapshots: remote mirror snapshot: id=8, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=59ce4f20-5cf7-4c9e-a2d9-ad769c7c8a6d, clean_since_snap_id=head]
2022-05-31T09:05:32.317-0400 7fb191c04700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_remote_mirror_snapshots: remote mirror snapshot: id=9, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=59ce4f20-5cf7-4c9e-a2d9-ad769c7c8a6d, clean_since_snap_id=head]
2022-05-31T09:05:32.317-0400 7fb191c04700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_remote_mirror_snapshots: remote mirror snapshot: id=11, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=59ce4f20-5cf7-4c9e-a2d9-ad769c7c8a6d, clean_since_snap_id=head]
2022-05-31T09:05:32.317-0400 7fb191c04700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_remote_mirror_snapshots: failed to locate remote start snapshot: snap_id=7                                                              
2022-05-31T09:05:32.317-0400 7fb191c04700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cade6f6000 scan_remote_mirror_snapshots: split-brain detected: failed to find matching non-primary snapshot in remote image: local_snap_id_start=7, local_snap_ns=[mirror state=non-primary, complete=1, mirror_peer_uuids=, primary_mirror_uuid=e1260ffa-0678-4a98-a264-4d7da43b071c, primary_snap_id=7, last_copied_object_number=5120, snap_seqs={7=18446744073709551614}]  
(primary) $ rbd snap ls --all img
SNAPID  NAME                                                                                       SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                         
     8  .mirror.primary.97262092-b5ab-4c3f-99c2-6f9fc740ffaf.6c78cc55-d26b-4ce5-8df9-f26fdf154f17  20 GiB             Tue May 31 09:05:20 2022  mirror (primary peer_uuids:[59ce4f20-5cf7-4c9e-a2d9-ad769c7c8a6d])
     9  .mirror.primary.97262092-b5ab-4c3f-99c2-6f9fc740ffaf.ad039ffc-4d8a-4587-a751-99c85e1fba5c  20 GiB             Tue May 31 09:05:23 2022  mirror (primary peer_uuids:[59ce4f20-5cf7-4c9e-a2d9-ad769c7c8a6d])
    11  .mirror.primary.97262092-b5ab-4c3f-99c2-6f9fc740ffaf.3b6aa60c-170b-4791-bf1b-f605f2d787fd  20 GiB             Tue May 31 09:05:31 2022  mirror (primary peer_uuids:[59ce4f20-5cf7-4c9e-a2d9-ad769c7c8a6d])
(secondary) $ rbd snap ls --all img
SNAPID  NAME                                                                                           SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                                       
     7  .mirror.non_primary.97262092-b5ab-4c3f-99c2-6f9fc740ffaf.7b706a77-1576-4de5-a3e7-6cf992ead19f  20 GiB             Tue May 31 09:04:53 2022  mirror (non-primary peer_uuids:[] e1260ffa-0678-4a98-a264-4d7da43b071c:7 copied)

Before commit https://github.com/ceph/ceph/commit/a888bff8d00e3e496ec80e4273e01a47b67da5dc this could happen pretty much all the time as it was the second oldest snapshot that was unlinked. This commit changed it to be the third oldest snapshot, turning this into a more narrow but still very much possible to hit race.


Related issues 3 (0 open3 closed)

Copied to rbd - Backport #55844: octopus: [rbd-mirror] primary snapshot in-use by replayer can be unlinked and removedResolvedIlya DryomovActions
Copied to rbd - Backport #55845: quincy: [rbd-mirror] primary snapshot in-use by replayer can be unlinked and removedResolvedIlya DryomovActions
Copied to rbd - Backport #55846: pacific: [rbd-mirror] primary snapshot in-use by replayer can be unlinked and removedResolvedIlya DryomovActions
Actions #1

Updated by Ilya Dryomov almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by Ilya Dryomov almost 2 years ago

  • Description updated (diff)
Actions #3

Updated by Ilya Dryomov almost 2 years ago

  • Description updated (diff)
Actions #4

Updated by Ilya Dryomov almost 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to octopus,pacific,quincy
  • Pull request ID set to 46454
Actions #5

Updated by Ilya Dryomov almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #6

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55844: octopus: [rbd-mirror] primary snapshot in-use by replayer can be unlinked and removed added
Actions #7

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55845: quincy: [rbd-mirror] primary snapshot in-use by replayer can be unlinked and removed added
Actions #8

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55846: pacific: [rbd-mirror] primary snapshot in-use by replayer can be unlinked and removed added
Actions #9

Updated by Ilya Dryomov almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF