Project

General

Profile

Actions

Bug #54344

closed

[rbd-mirror] disabling and shortly after re-enabling mirroring on the image can lead to split-brain

Added by Ilya Dryomov about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

As a workaround for another issue, Ceph CSI is doing more or less this on a regular basis (see https://github.com/ceph/ceph-csi/pull/2656):

(primary) $ rbd mirror image disable data/test1
(primary) $ rbd mirror image enable data/test1

Unfortunately there is a nasty race here. The below is a log from rbd-mirror that was monkey-patched to extend the race window, but it seems to be hit on real clusters pretty reliably as soon as a couple of hundred images are involved.

2022-02-19T11:14:37.978-0500 7f0ad79e0700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 get_remote_image_state: 
2022-02-19T11:14:37.978-0500 7f0ad79e0700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 handle_get_remote_image_state: r=-2
2022-02-19T11:14:37.978-0500 7f0ad79e0700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 handle_get_remote_image_state: failed to retrieve remote snapshot image state: (2) No such file or directory                      <--------- snapshot image state is removed on the primary as part of disabling mirroring,
                                                                                                                                                                                                                                                          replayer shutdown started
2022-02-19T11:14:37.978-0500 7f0ad79e0700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 notify_status_updated: 
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] handle_replayer_notification: 
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] handle_replayer_notification: replay interrupted: r=-2, error=failed to retrieve remote snapshot image state
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] on_stop_journal_replay: 
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] cancel_update_mirror_image_replay_status: 
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] set_state_description: r=-2, desc=failed to retrieve remote snapshot image state
2022-02-19T11:14:37.978-0500 7f0ade1ed700 15 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] update_mirror_image_status: force=1, state=--
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] shut_down: r=0
2022-02-19T11:14:37.978-0500 7f0ade1ed700 15 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] shut_down: waiting for in-flight operations to complete
2022-02-19T11:14:37.978-0500 7f0ade1ed700 15 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] set_mirror_image_status_update: force=1, state=--
2022-02-19T11:14:37.978-0500 7f0ade1ed700 15 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] set_mirror_image_status_update: status={state=up+stopping_replay, description=failed to retrieve remote snapshot image state, last_update=0.000000]}
2022-02-19T11:14:37.978-0500 7f0ade1ed700 15 rbd::mirror::MirrorStatusUpdater 0x55af196e4900 set_mirror_image_status: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, mirror_image_site_status={state=up+stopping_replay, description=failed to retrieve remote snapshot image state, last_update=0.000000]}
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::MirrorStatusUpdater 0x55af196e4900 queue_update_task: deferring update due to in-flight ops
2022-02-19T11:14:37.978-0500 7f0ade1ed700 15 rbd::mirror::MirrorStatusUpdater 0x55af1a37a000 set_mirror_image_status: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, mirror_image_site_status={state=up+stopping_replay, description=failed to retrieve remote snapshot image state, last_update=0.000000]}
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::MirrorStatusUpdater 0x55af1a37a000 queue_update_task: deferring update due to in-flight ops
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] shut_down: r=0
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 shut_down: 
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 unregister_remote_update_watcher:
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 handle_unregister_remote_update_watcher: r=0
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 unregister_local_update_watcher:
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 handle_unregister_local_update_watcher: r=0
2022-02-19T11:14:37.978-0500 7f0ade1ed700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 wait_for_in_flight_ops:                                                                                                           <--------- replayer shutdown gets blocked for a while
2022-02-19T11:14:37.986-0500 7f0ad71df700 10 rbd::mirror::MirrorStatusUpdater 0x55af196e4900 handle_update_task:
2022-02-19T11:14:37.986-0500 7f0ad71df700 10 rbd::mirror::MirrorStatusUpdater 0x55af196e4900 queue_update_task:
2022-02-19T11:14:37.986-0500 7f0add9ec700 10 rbd::mirror::MirrorStatusUpdater 0x55af196e4900 update_task:
2022-02-19T11:14:37.990-0500 7f0ad11d3700 10 rbd::mirror::MirrorStatusUpdater 0x55af1a37a000 handle_update_task:
2022-02-19T11:14:37.990-0500 7f0ad11d3700 10 rbd::mirror::MirrorStatusUpdater 0x55af1a37a000 queue_update_task:
2022-02-19T11:14:37.990-0500 7f0ade1ed700 10 rbd::mirror::MirrorStatusUpdater 0x55af1a37a000 update_task:
2022-02-19T11:14:37.994-0500 7f0ad79e0700 10 rbd::mirror::MirrorStatusUpdater 0x55af196e4900 handle_update_task:
2022-02-19T11:14:37.994-0500 7f0ad09d2700 10 rbd::mirror::MirrorStatusUpdater 0x55af1a37a000 handle_update_task:
2022-02-19T11:14:39.058-0500 7f0ad11d3700 20 rbd::mirror::MirrorStatusWatcher: 0x55af1a37c000 handle_notify:
2022-02-19T11:14:39.058-0500 7f0ad11d3700 10 rbd::mirror::PoolWatcher: 0x55af1a38e1e0 handle_image_updated: image_id=1029486fcfd6, global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, enabled=0                                             <--------- "mirroring is disabled, remove the image" notification from the primary
2022-02-19T11:14:39.058-0500 7f0ad11d3700 20 rbd::mirror::PoolWatcher: 0x55af1a38e1e0 schedule_listener:
2022-02-19T11:14:39.058-0500 7f0add9ec700 10 rbd::mirror::PoolWatcher: 0x55af1a38e1e0 notify_listener:
2022-02-19T11:14:39.058-0500 7f0add9ec700 20 rbd::mirror::PoolWatcher: 0x55af1a38e1e0 notify_listener: image_id=global id=630f1503-b331-49ac-a497-98f28c6aa8ae, id=1029486fcfd6
2022-02-19T11:14:39.058-0500 7f0add9ec700 10 rbd::mirror::NamespaceReplayer: 0x55af1a36e000 handle_update: mirror_uuid=59b2a772-019c-4bbc-9fbc-d72c8e317f72, added_count=0, removed_count=1
2022-02-19T11:14:39.058-0500 7f0add9ec700 20 rbd::mirror::ServiceDaemon: 0x55af18a2ea20 add_or_update_attribute: pool_id=2, key=image_local_count, value=0
2022-02-19T11:14:39.058-0500 7f0add9ec700 20 rbd::mirror::ServiceDaemon: 0x55af18a2ea20 add_or_update_attribute: pool_id=2, key=image_remote_count, value=0
2022-02-19T11:14:39.058-0500 7f0add9ec700  5 rbd::mirror::ImageMap: 0x55af17d6bb00 update_images: peer_uuid=59b2a772-019c-4bbc-9fbc-d72c8e317f72, added_count=0, removed_count=1
2022-02-19T11:14:39.058-0500 7f0add9ec700  5 rbd::mirror::ImageMap: 0x55af17d6bb00 update_images_removed: peer_uuid=59b2a772-019c-4bbc-9fbc-d72c8e317f72, global_image_ids=[630f1503-b331-49ac-a497-98f28c6aa8ae]
2022-02-19T11:14:39.058-0500 7f0add9ec700 20 rbd::mirror::image_map::Policy: 0x55af17cea540 lookup: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae
2022-02-19T11:14:39.058-0500 7f0add9ec700  5 rbd::mirror::image_map::Policy: 0x55af17cea540 remove_image: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae
2022-02-19T11:14:39.058-0500 7f0add9ec700 20 rbd::mirror::ImageMap: 0x55af17d6bb00 schedule_action: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae
2022-02-19T11:14:39.058-0500 7f0add9ec700  5 rbd::mirror::ImageMap: 0x55af17d6bb00 notify_listener_remove_images: peer_uuid=59b2a772-019c-4bbc-9fbc-d72c8e317f72, remove=[{global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, instance_id=4741}]
2022-02-19T11:14:39.058-0500 7f0add9ec700  5 rbd::mirror::NamespaceReplayer: 0x55af1a36e000 handle_remove_image: mirror_uuid=59b2a772-019c-4bbc-9fbc-d72c8e317f72, global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, instance_id=4741
2022-02-19T11:14:39.058-0500 7f0add9ec700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 notify_peer_image_removed: instance_id=4741, global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, peer_mirror_uuid=59b2a772-019c-4bbc-9fbc-d72c8e317f72
2022-02-19T11:14:39.058-0500 7f0add9ec700 10 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x55af1a39a900 C_NotifyInstanceRequest: instance_watcher=0x55af17d12000, instance_id=4741, request_id=2
2022-02-19T11:14:39.058-0500 7f0add9ec700 10 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x55af1a39a900 send
2022-02-19T11:14:39.058-0500 7f0add9ec700 10 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x55af1a39a900 send: sending to 4741
2022-02-19T11:14:39.058-0500 7f0add9ec700 20 rbd::mirror::ImageMap: 0x55af17d6bb00 schedule_update_task: scheduling image check update (0x55af17cc1570) after 1 second(s)
2022-02-19T11:14:39.058-0500 7f0ad71df700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 handle_notify: notify_id=55834574853, handle=94210547452160, notifier_id=4741
2022-02-19T11:14:39.058-0500 7f0ad71df700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 handle_payload: remove_peer_image: instance_id=4741, request_id=2
2022-02-19T11:14:39.058-0500 7f0ad71df700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 prepare_request: instance_id=4741, request_id=2
2022-02-19T11:14:39.058-0500 7f0ad71df700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 handle_peer_image_removed: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, peer_mirror_uuid=59b2a772-019c-4bbc-9fbc-d72c8e317f72
2022-02-19T11:14:39.058-0500 7f0ad71df700  5 librbd::Watcher: 0x55af17d12000 notifications_blocked: blocked=0
2022-02-19T11:14:39.058-0500 7f0ade1ed700 10 rbd::mirror::InstanceReplayer: 0x55af18b663c0 remove_peer_image: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, peer_mirror_uuid=59b2a772-019c-4bbc-9fbc-d72c8e317f72                      <--------- in order to remove the image remove_peer_image() restarts the replayer
                                                                                                                                                                                                                                                          upon bootstrap failure the starting replayer is supposed to trash the image
2022-02-19T11:14:39.058-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] stop: on_finish=0x55af1a463780, manual=0, restart=1
2022-02-19T11:14:39.058-0500 7f0ade1ed700 20 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] stop: not running                                                                                             <--------- BUG: stop _is_ running -- replayer shutdown still blocked!
2022-02-19T11:14:39.058-0500 7f0ade1ed700 10 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] start: on_finish=0x55af1a4637a0
2022-02-19T11:14:39.058-0500 7f0ade1ed700 -1 rbd::mirror::ImageReplayer: 0x55af195da500 [2/630f1503-b331-49ac-a497-98f28c6aa8ae] start: already running                                                                                        <--------- start fails, restart initiated by remove_peer_image() is effectively canceled
2022-02-19T11:14:39.058-0500 7f0ade1ed700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 complete_request: instance_id=4741, request_id=2
2022-02-19T11:14:39.058-0500 7f0add9ec700 10 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x55af1a39a900 finish: r=0
2022-02-19T11:14:39.058-0500 7f0add9ec700  5 rbd::mirror::ImageMap: 0x55af17d6bb00 handle_peer_ack_remove: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae
2022-02-19T11:14:40.058-0500 7f0ada9e6700 20 rbd::mirror::ServiceDaemon: 0x55af18a2ea20 update_status:
2022-02-19T11:14:40.058-0500 7f0ada9e6700 20 rbd::mirror::ImageMap: 0x55af17d6bb00 process_updates:
2022-02-19T11:14:40.058-0500 7f0ada9e6700  5 rbd::mirror::image_map::Policy: 0x55af17cea540 start_action: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, state=DISSOCIATING, action_type=RELEASE
2022-02-19T11:14:40.058-0500 7f0ada9e6700 20 rbd::mirror::image_map::Policy: 0x55af17cea540 lookup: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae
2022-02-19T11:14:40.058-0500 7f0ada9e6700 15 rbd::mirror::ImageMap: 0x55af17d6bb00 process_updates: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, action=RELEASE, instance=4741
2022-02-19T11:14:40.058-0500 7f0ada9e6700  5 rbd::mirror::ImageMap: 0x55af17d6bb00 notify_listener_acquire_release_images: acquire=[], release=[{global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, instance_id=4741}]
2022-02-19T11:14:40.058-0500 7f0ada9e6700  5 rbd::mirror::NamespaceReplayer: 0x55af1a36e000 handle_release_image: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, instance_id=4741
2022-02-19T11:14:40.058-0500 7f0ada9e6700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 notify_image_release: instance_id=4741, global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae
2022-02-19T11:14:40.058-0500 7f0ada9e6700 10 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x55af1a39a9c0 C_NotifyInstanceRequest: instance_watcher=0x55af17d12000, instance_id=4741, request_id=3
2022-02-19T11:14:40.058-0500 7f0ada9e6700 10 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x55af1a39a9c0 send
2022-02-19T11:14:40.058-0500 7f0ada9e6700 10 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x55af1a39a9c0 send: sending to 4741
2022-02-19T11:14:40.058-0500 7f0ad71df700  5 librbd::Watcher: 0x55af17d12000 notifications_blocked: blocked=0
2022-02-19T11:14:40.058-0500 7f0ad71df700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 handle_notify: notify_id=55834574854, handle=94210547452160, notifier_id=4741
2022-02-19T11:14:40.058-0500 7f0ad71df700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 handle_payload: image_release: instance_id=4741, request_id=3
2022-02-19T11:14:40.058-0500 7f0ad71df700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 prepare_request: instance_id=4741, request_id=3
2022-02-19T11:14:40.058-0500 7f0ad71df700 10 rbd::mirror::InstanceWatcher: 0x55af17d12000 handle_image_release: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae
2022-02-19T11:14:40.058-0500 7f0ade1ed700 10 rbd::mirror::InstanceReplayer: 0x55af18b663c0 release_image: global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae                                                                                  <--------- replayer is erased from the map of known replayers
                                                                                                                                                                                                                                                           this means that it wouldn't be restarted by rbd_mirror_image_state_check_interval handler either
2022-02-19T11:14:40.058-0500 7f0ade1ed700 10 rbd::mirror::InstanceReplayer: 0x55af18b663c0 stop_image_replayer: 0x55af195da500 global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, on_finish=0x55af1a462d80
2022-02-19T11:14:40.058-0500 7f0ade1ed700 10 rbd::mirror::InstanceReplayer: 0x55af18b663c0 stop_image_replayer: scheduling image replayer 0x55af195da500 stop after 1 sec (task 0x55af1a463d80)
2022-02-19T11:14:40.862-0500 7f0ae1a0c680 20 rbd::mirror::Mirror: 0x55af17cde280 run_cache_manager: tune memory
2022-02-19T11:14:40.894-0500 7f0ada9e6700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 execute_timer_task: 
2022-02-19T11:14:40.894-0500 7f0ada9e6700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 is_leader: 1
2022-02-19T11:14:40.894-0500 7f0ada9e6700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 notify_heartbeat: 
2022-02-19T11:14:40.894-0500 7f0ada9e6700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 is_leader: 1
2022-02-19T11:14:40.898-0500 7f0ad79e0700  5 librbd::Watcher: 0x55af1a34d200 notifications_blocked: blocked=0
2022-02-19T11:14:40.898-0500 7f0ad79e0700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 handle_notify: notify_id=55834574855, handle=94210547454464, notifier_id=4741
2022-02-19T11:14:40.898-0500 7f0ad79e0700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 handle_notify: our own notification, ignoring
2022-02-19T11:14:40.898-0500 7f0add9ec700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 handle_notify_heartbeat: r=0
2022-02-19T11:14:40.898-0500 7f0add9ec700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 is_leader: 1
2022-02-19T11:14:40.898-0500 7f0add9ec700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 handle_notify_heartbeat: 1 acks received, 0 timed out
2022-02-19T11:14:40.898-0500 7f0add9ec700 10 rbd::mirror::Instances: 0x55af1a37a120 acked: instance_ids=[4741]
2022-02-19T11:14:40.898-0500 7f0add9ec700 10 rbd::mirror::LeaderWatcher: 0x55af1a34d200 schedule_timer_task: scheduling heartbeat after 5 sec (task 0x55af1a328ed0)
2022-02-19T11:14:40.898-0500 7f0add9ec700  5 rbd::mirror::Instances: 0x55af1a37a120 handle_acked: instance_ids=[4741]
2022-02-19T11:14:41.058-0500 7f0ade1ed700 10 rbd::mirror::InstanceReplayer: 0x55af18b663c0 stop_image_replayer: 0x55af195da500 global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, on_finish=0x55af1a462d80
2022-02-19T11:14:41.058-0500 7f0ade1ed700 10 rbd::mirror::InstanceReplayer: 0x55af18b663c0 stop_image_replayer: scheduling image replayer 0x55af195da500 stop after 1 sec (task 0x55af1a463dc0)
2022-02-19T11:14:42.058-0500 7f0add9ec700 10 rbd::mirror::InstanceReplayer: 0x55af18b663c0 stop_image_replayer: 0x55af195da500 global_image_id=630f1503-b331-49ac-a497-98f28c6aa8ae, on_finish=0x55af1a462d80
2022-02-19T11:14:42.058-0500 7f0add9ec700 10 rbd::mirror::InstanceReplayer: 0x55af18b663c0 stop_image_replayer: scheduling image replayer 0x55af195da500 stop after 1 sec (task 0x55af1a4638e0)
2022-02-19T11:14:42.982-0500 7f0ada9e6700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55af1a471000 handle_wait_for_in_flight_ops: r=0                                                                                                <--------- replayer shutdown is unblocked and soon it gets destroyed, the image is leaked

The stale image then causes split-brain errors if mirroring is re-enabled because rbd-mirror can't create a new image with the same name:

(primary) $ rbd mirror image status data/test1
test1:
  global_id:   bf69b7b7-3c22-4c9c-9767-e18930008205
  state:       up+stopped
  description: local image is primary
  service:     admin on rbd-mirror-test
  last_update: 2022-02-21 07:45:25
  peer_sites:
    name: site-b
    state: up+error
    description: split-brain detected
    last_update: 2022-02-21 07:45:26

As a workaround, restarting rbd-mirror would trigger an image re-scan and the offending stale image would be removed.


Related issues 3 (0 open3 closed)

Related to rbd - Bug #55317: [test] add_event_after() expects an externally-provided mutex to be heldResolvedIlya Dryomov

Actions
Copied to rbd - Backport #54377: octopus: [rbd-mirror] disabling and shortly after re-enabling mirroring on the image can lead to split-brainResolvedPonnuvel PActions
Copied to rbd - Backport #54378: pacific: [rbd-mirror] disabling and shortly after re-enabling mirroring on the image can lead to split-brainResolvedDeepika UpadhyayActions
Actions #1

Updated by Ilya Dryomov about 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to octopus,pacific
  • Pull request ID set to 45106
Actions #2

Updated by Ilya Dryomov about 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #3

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54377: octopus: [rbd-mirror] disabling and shortly after re-enabling mirroring on the image can lead to split-brain added
Actions #4

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54378: pacific: [rbd-mirror] disabling and shortly after re-enabling mirroring on the image can lead to split-brain added
Actions #5

Updated by Ilya Dryomov about 2 years ago

  • Related to Bug #55317: [test] add_event_after() expects an externally-provided mutex to be held added
Actions #6

Updated by Ilya Dryomov almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF