Project

General

Profile

Actions

Bug #61672

closed

rbd-mirror: non-primary images not deleted when the primary images are deleted

Added by Nithya Balachandran 11 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Racing calls to InstanceReplayer->release_image() and ImageReplayer->handle_bootstrap() in the non-primary rbd mirror daemon may prevent the non-primary image from being deleted when the primary image is deleted.

The InstanceReplayer determines that the remote image has been deleted and restarts the ImageReplayer. The restart calls bootstrap() which determines that the peer image has been deleted.
ImageReplayer::handle_bootstrap() is called with r=-ENOLINK which sets m_delete_requested to true and calls shut_down. The handle_shut_down() sees that m_delete_requested is true and schedules an image delete.

ImageReplayer::stop()
> on_stop_journal_replay()
-> m_stop_requested = true; m_state = STATE_STOPPING;
-> shut_down(0)
-> handle_shut_down()
-> stop complete.
ImageReplayer::start() < -
restarts
-> bootstrap()
-> handle_bootstrap(r=-67) // #define ENOLINK 67 /* Link has been severed */
-> m_delete_requested = true
->shut_down()
->handle_shutdown()
-> if m_delete_requested == true
schedules deletion

template <typename I>
void ImageReplayer<I>::handle_bootstrap(int r) {
dout(10) << "r=" << r << dendl; {
std::lock_guard locker{m_lock};
m_bootstrap_request->put();
m_bootstrap_request = nullptr;
}

if (on_start_interrupted()) {
return; <---------- The call returns here when the image is not deleted because m_stop_requested is true
} else if (r ENOMSG) {
dout(5) << "local image is primary" << dendl;
on_start_fail(0, "local image is primary");
return;
}
...
} else if (r -ENOLINK) {
m_delete_requested = true;
on_start_fail(0, "remote image no longer exists"); <-
The call returns here when the image is deleted
return;
}

In the case where the image is not deleted, handle_bootstrap() determines that the start has been interrupted and returns without processing the -ENOLINK code path and without setting m_delete_requested to true. The image is this not moved to trash or deleted.

Not easily reproducible.


Related issues 3 (0 open3 closed)

Copied to rbd - Backport #62111: pacific: rbd-mirror: non-primary images not deleted when the primary images are deletedResolvedNithya BalachandranActions
Copied to rbd - Backport #62112: quincy: rbd-mirror: non-primary images not deleted when the primary images are deletedResolvedNithya BalachandranActions
Copied to rbd - Backport #62113: reef: rbd-mirror: non-primary images not deleted when the primary images are deletedResolvedNithya BalachandranActions
Actions #1

Updated by Nithya Balachandran 11 months ago

  • Assignee set to Nithya Balachandran
Actions #2

Updated by Ilya Dryomov 11 months ago

  • Status changed from New to Fix Under Review
  • Backport set to pacific,quincy,reef
  • Pull request ID set to 52057
Actions #3

Updated by Nithya Balachandran 11 months ago

Sequence of logs when the image is deleted: ============================================

2023-03-21T06:49:18.058+0000 7f28ebd79640 10 rbd::mirror::InstanceReplayer: 0x557e76816500 remove_peer_image: global_image_id=367637f7-51c2-492e-83c8-5b908b2ba1d1, peer_mirror_uuid=61887bf9-1495-4c7d-aac5-08a561df04e9
2023-03-21T06:49:18.058+0000 7f28ebd79640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] stop: on_finish=0x557e79b5f460, manual=0, restart=1
2023-03-21T06:49:18.058+0000 7f28ebd79640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] stop: interrupting replay
2023-03-21T06:49:18.058+0000 7f28ebd79640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] on_stop_journal_replay:
...

2023-03-21T06:49:18.058+0000 7f28ebd79640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] shut_down: r=0
2023-03-21T06:49:18.058+0000 7f28ebd79640 15 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] shut_down: waiting for in-flight operations to complete
2023-03-21T06:49:19.136+0000 7f28ded5f640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] handle_shut_down: stop complete
2023-03-21T06:49:19.136+0000 7f28ded5f640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] handle_shut_down: on stop finish 0x557e79b5f460 complete, r=0
2023-03-21T06:49:19.136+0000 7f28ded5f640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] start: on_finish=0x557e7a4708a0
2023-03-21T06:49:19.136+0000 7f28ded5f640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] bootstrap:

...

2023-03-21T06:49:19.292+0000 7f28ded5f640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] handle_bootstrap: r=-67
2023-03-21T06:49:19.292+0000 7f28ded5f640 10 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] on_start_fail: r=0, desc=remote image no longer exists

...

2023-03-21T06:49:19.292+0000 7f28eb578640 0 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] handle_shut_down: remote image no longer exists: scheduling deletion
2023-03-21T06:49:19.292+0000 7f28eb578640 15 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] unregister_admin_socket_hook:
2023-03-21T06:49:19.292+0000 7f28eb578640 5 rbd::mirror::ImageReplayer: 0x557e78f69180 [1/367637f7-51c2-492e-83c8-5b908b2ba1d1] handle_shut_down: moving image to trash
2023-03-21T06:49:19.292+0000 7f28eb578640 10 rbd::mirror::ImageDeleter: trash_move: global_image_id=367637f7-51c2-492e-83c8-5b908b2ba1d1, resync=0


In the failure scenario, the logs differ:

2023-03-21T06:49:20.092+0000 7f28eb578640 10 rbd::mirror::ImageReplayer: 0x557e78f69680 [1/4b38c2de-cf0b-4500-bae1-996aedccdfc1] stop: on_finish=0x557e7a76e320, manual=0, restart=0
2023-03-21T06:49:20.092+0000 7f28eb578640 10 rbd::mirror::ImageReplayer: 0x557e78f69680 [1/4b38c2de-cf0b-4500-bae1-996aedccdfc1] stop: canceling start
2023-03-21T06:49:20.092+0000 7f28eb578640 10 rbd::mirror::ImageReplayer: 0x557e78f69680 [1/4b38c2de-cf0b-4500-bae1-996aedccdfc1] stop: canceling bootstrap
2023-03-21T06:49:20.223+0000 7f28ded5f640 10 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x557e78125b60 handle_get_image_id: global image 4b38c2de-cf0b-4500-bae1-996aedccdfc1 not registered
2023-03-21T06:49:20.223+0000 7f28ded5f640 10 rbd::mirror::ImageReplayer: 0x557e78f69680 [1/4b38c2de-cf0b-4500-bae1-996aedccdfc1] handle_bootstrap: r=-67
2023-03-21T06:49:20.223+0000 7f28ded5f640 10 rbd::mirror::ImageReplayer: 0x557e78f69680 [1/4b38c2de-cf0b-4500-bae1-996aedccdfc1] on_start_fail: r=-125, desc=
2023-03-21T06:49:20.223+0000 7f28eb578640 10 rbd::mirror::ImageReplayer: 0x557e78f69680 [1/4b38c2de-cf0b-4500-bae1-996aedccdfc1] operator(): start canceled

In summary:

InstanceReplayer::remove_peer_image() is called when a remote image is deleted.
This calls ImageReplayer::restart() which calls ImageReplayer::bootstrap() which determines that the remote image no longer exists.
ImageReplayer::handle_bootstrap() is called with r=-ENOLINK fails the start().
In the case of the failed delete, a InstanceReplayer::release_image() before the call to bootstrap() sets ImageReplayer::m_stop_requested to true.
handle_bootstrap() checks for this before processing the case where r==-ENOLINK, so the image is never moved to trash and hence not deleted.

Actions #4

Updated by Ilya Dryomov 9 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Backport Bot 9 months ago

  • Copied to Backport #62111: pacific: rbd-mirror: non-primary images not deleted when the primary images are deleted added
Actions #6

Updated by Backport Bot 9 months ago

  • Copied to Backport #62112: quincy: rbd-mirror: non-primary images not deleted when the primary images are deleted added
Actions #7

Updated by Backport Bot 9 months ago

  • Copied to Backport #62113: reef: rbd-mirror: non-primary images not deleted when the primary images are deleted added
Actions #8

Updated by Backport Bot 9 months ago

  • Tags set to backport_processed
Actions #9

Updated by Ilya Dryomov 6 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF