Project

General

Profile

Actions

Bug #59115

closed

[rbd-mirror] unlink_peer gets stuck during blocklist

Added by Christopher Hoffman about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During blocklist and the subsequent replayer shutdown the shutdown process gets halted due to replayer syncs being in progress. There's a case where unlink_peer is being called during this time. This unlink_peer runs into no watcher handle present, bails from process and the shut_down process doesn't resume. This results in a hung blocklist recovery.

The below unlink_peer getting stuck. It should have a handle_unlink_peer call after shut_down.

2022-12-14T14:51:24.671+0000 7f0b6af50700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cf86d6f800 unlink_peer: remote_snap_id=1657627
2022-12-14T14:58:04.357+0000 7f0b70f5c700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cf86d6f800 shut_down: 
2022-12-14T14:58:04.357+0000 7f0b70f5c700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cf86d6f800 shut_down: shut down pending on completion of snapshot replay

Files

reproducer_59115.sh (1.91 KB) reproducer_59115.sh Christopher Hoffman, 04/05/2023 08:36 PM
blocklist.sh (229 Bytes) blocklist.sh Christopher Hoffman, 04/05/2023 08:38 PM

Related issues 3 (0 open3 closed)

Related to rbd - Bug #61607: hang due to exclusive lock acquisition (STATE_WAITING_FOR_LOCK) racing with blocklistingResolvedRamana Raja

Actions
Copied to rbd - Backport #59369: quincy: unlink_peer gets stuck during blocklistResolvedChristopher HoffmanActions
Copied to rbd - Backport #59370: pacific: unlink_peer gets stuck during blocklistResolvedChristopher HoffmanActions
Actions #1

Updated by Christopher Hoffman about 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 50630
Actions #2

Updated by Christopher Hoffman about 1 year ago

I was able to produce this issue:

1. Introduced a sleep in "src/librbd/Operations.cc"

   void send_acquire_exclusive_lock() {
+    usleep(15000000);

2. vstart with 2 clusters (site A, site B)
3. Created 10 rbd images in 1 pool on site A

steps 2-3 run reproducer_59115.sh

4. Wrote continuously to 10 images on site A

for i in {1..1000}; do sh payload.sh; done

5. Blocklisted rbd mirror client on site B every 90 seconds from site A.

for i in {1..1000}; do sh blocklist.sh; sleep 60s; done

Eventually I saw the above message along with "watcher not registered - delaying request" in the rbd mirror peer file on site A.

Actions #3

Updated by Ilya Dryomov about 1 year ago

  • Backport set to pacific,quincy
Actions #5

Updated by Ilya Dryomov about 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #6

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59369: quincy: unlink_peer gets stuck during blocklist added
Actions #7

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59370: pacific: unlink_peer gets stuck during blocklist added
Actions #8

Updated by Backport Bot about 1 year ago

  • Tags set to backport_processed
Actions #9

Updated by Christopher Hoffman about 1 year ago

I've used the reproducer steps above to validate patch.

Validated:
1. Hitting code path for handling blocklist case, confirmed with "watcher not registered - client blocklisted" in log message.
2. Ran for long duration and didn't run into any lockdep or any other problem. Rbd-mirror was continuously blocklisted and was able to recover each time.

Actions #10

Updated by Ilya Dryomov about 1 year ago

  • Subject changed from unlink_peer gets stuck during blocklist to [rbd-mirror] unlink_peer gets stuck during blocklist
Actions #11

Updated by Ilya Dryomov 12 months ago

  • Related to Bug #61607: hang due to exclusive lock acquisition (STATE_WAITING_FOR_LOCK) racing with blocklisting added
Actions #12

Updated by Backport Bot 11 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF