Bug #59115
closed[rbd-mirror] unlink_peer gets stuck during blocklist
0%
Description
During blocklist and the subsequent replayer shutdown the shutdown process gets halted due to replayer syncs being in progress. There's a case where unlink_peer is being called during this time. This unlink_peer runs into no watcher handle present, bails from process and the shut_down process doesn't resume. This results in a hung blocklist recovery.
The below unlink_peer getting stuck. It should have a handle_unlink_peer call after shut_down.
2022-12-14T14:51:24.671+0000 7f0b6af50700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cf86d6f800 unlink_peer: remote_snap_id=1657627
2022-12-14T14:58:04.357+0000 7f0b70f5c700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cf86d6f800 shut_down:
2022-12-14T14:58:04.357+0000 7f0b70f5c700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55cf86d6f800 shut_down: shut down pending on completion of snapshot replay
Files
Updated by Christopher Hoffman about 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 50630
Updated by Christopher Hoffman about 1 year ago
I was able to produce this issue:
1. Introduced a sleep in "src/librbd/Operations.cc"
void send_acquire_exclusive_lock() {
+ usleep(15000000);
2. vstart with 2 clusters (site A, site B)
3. Created 10 rbd images in 1 pool on site A
steps 2-3 run reproducer_59115.sh
4. Wrote continuously to 10 images on site A
for i in {1..1000}; do sh payload.sh; done
5. Blocklisted rbd mirror client on site B every 90 seconds from site A.
for i in {1..1000}; do sh blocklist.sh; sleep 60s; done
Eventually I saw the above message along with "watcher not registered - delaying request" in the rbd mirror peer file on site A.
Updated by Christopher Hoffman about 1 year ago
- File reproducer_59115.sh reproducer_59115.sh added
- File blocklist.sh blocklist.sh added
Updated by Ilya Dryomov about 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 1 year ago
- Copied to Backport #59369: quincy: unlink_peer gets stuck during blocklist added
Updated by Backport Bot about 1 year ago
- Copied to Backport #59370: pacific: unlink_peer gets stuck during blocklist added
Updated by Christopher Hoffman about 1 year ago
I've used the reproducer steps above to validate patch.
Validated:
1. Hitting code path for handling blocklist case, confirmed with "watcher not registered - client blocklisted" in log message.
2. Ran for long duration and didn't run into any lockdep or any other problem. Rbd-mirror was continuously blocklisted and was able to recover each time.
Updated by Ilya Dryomov about 1 year ago
- Subject changed from unlink_peer gets stuck during blocklist to [rbd-mirror] unlink_peer gets stuck during blocklist
Updated by Ilya Dryomov 12 months ago
- Related to Bug #61607: hang due to exclusive lock acquisition (STATE_WAITING_FOR_LOCK) racing with blocklisting added
Updated by Backport Bot 11 months ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".