Steps to reproduce:
1. Build the code with https://github.com/ceph/ceph/pull/51540
2. Setup mirroring on two clusters, enable mirroring on an image.
3. Kill site-a which is current primary and also kill site-a mirror daemon
4. Force promote image on site-b
Was able to hit below two scenarios (after decent 50 retries):
Scenario 1: Case where it is in {state=up+stopped, description=force promoted} forever!!!
void ImageReplayer<I>::shut_down(int r)
[...]
void ImageReplayer<I>::handle_shut_down(int r) --> sets the m_state = STATE_STOPPED;
Please note `ImageReplayer<I>::handle_shut_down` calls itself a couple of times conditionally after the initial invocation.
Conditions which invokes `ImageReplayer<I>::handle_shut_down` from itself in this usecase include:
void ImageReplayer<I>::handle_shut_down(int r) {
[...]
if (!m_in_flight_op_tracker.empty()) {
┆ dout(15) << "waiting for in-flight operations to complete" << dendl;
┆ m_in_flight_op_tracker.wait_for_ops(new LambdaContext([this, r](int) {
┆ ┆ ┆ handle_shut_down(r);
┆ ┆ }));
┆ return;
}
if (!m_status_removed) {
┆ auto ctx = new LambdaContext([this, r](int) {
┆ ┆ m_status_removed = true;
┆ ┆ handle_shut_down(r);
┆ });
┆ remove_image_status(m_delete_in_progress, ctx);
┆ return;
}
[...]
dout(10) << "stop complete" << dendl;
[...]
{
[...]
┆ m_state = STATE_STOPPED;
}
}
So you shouldn't be surprised to see m_state not set to STATE_STOPPED after the initial couple of invocations.
While ImageReplayer<I>::handle_shut_down still didn't progress to "stop complete" ( m_state not yet = STATE_STOPPED; )
InstanceReplayer<I>::queue_start_image_replayers()
-> InstanceReplayer<I>::start_image_replayers
-> InstanceReplayer<I>::start_image_replayer
returns from here ..! i.e. it doesn't start a image replayer.
void InstanceReplayer<I>::start_image_replayer(
[...]
if (!image_replayer->is_stopped()) {
return;
}
[...]
}
Note: InstanceReplayer<I>::queue_start_image_replayers() --> is invoked every 5 seconds base on `rbd_mirror_image_state_check_interval`
While `ImageReplayer<I>::handle_shut_down` is still at
1) "waiting for in-flight operations to complete"
or
2) remove_image_status, but not yet reached "stop complete"....
void RemotePoolPoller<I>::schedule_task(int r) --> polls every POLL_INTERVAL_SECONDS = 30 seconds
-> void RemotePoolPoller<I>::handle_task()
-> void RemotePoolPoller<I>::get_mirror_uuid()
-> void RemotePoolPoller<I>::handle_get_mirror_uuid(int r)
-> void RemotePoolPoller<I>::mirror_peer_ping()
-> void RemotePoolPoller<I>::handle_mirror_peer_ping(int r)
-> void RemotePoolPoller<I>::mirror_peer_list()
-> void RemotePoolPoller<I>::handle_mirror_peer_list(int r)
-> void RemotePoolPoller<I>::notify_listener() --> m_listener.handle_updated()
struct PoolReplayer<I>::RemotePoolPollerListener
: public remote_pool_poller::Listener {
[...]
void handle_updated(const RemotePoolMeta& remote_pool_meta) override {
m_pool_replayer->handle_remote_pool_meta_updated(remote_pool_meta);
}
};
-> void PoolReplayer<I>::handle_remote_pool_meta_updated() {
[...]
derr << "remote pool metadata updated unexpectedly" << dendl;
m_stopping = true;
m_cond.notify_all();
}
Note m_stopping = true; now...
Not to remind `ImageReplayer<I>::handle_shut_down` is still at 1. "waiting for in-flight operations to complete" or 2. remove_image_status, but not reached "stop complete" yet .... i.e. m_state in the image replayer is still not at STATE_STOPPED.
Even any call to queue_start_image_replayers() -> start_image_replayers -> start_image_replayer and returns..! i.e. it doesn't start a image replayer.
On the next update_pool_replayers() --> invoked every 30 seconds, based on `rbd_mirror_pool_replayers_refresh_interval`
void Mirror::update_pool_replayers()
-> "restarting failed pool replayer for"
-> pool_replayer->shut_down();
-> pool_replayer->init(site_name);
Mean while, the pool replayer shut down will lead to all the other components shutdown.
void PoolReplayer<I>::shut_down()
-> void LeaderWatcher<I>::shut_down(Context *on_finish)
-> void LeaderWatcher<I>::cancel_timer_task()
-> void LeaderWatcher<I>::shut_down_leader_lock()
-> m_leader_lock->shut_down(ctx);
void NamespaceReplayer<I>::shut_down(Context *on_finish)
-> void NamespaceReplayer<I>::stop_instance_replayer()
-> void NamespaceReplayer<I>::handle_stop_instance_replayer(int r)
-> void NamespaceReplayer<I>::shut_down_instance_watcher()
-> void NamespaceReplayer<I>::handle_shut_down_instance_watcher(int r)
-> void NamespaceReplayer<I>::shut_down_instance_replayer()
-> void InstanceReplayer<I>::shut_down()
-> void InstanceReplayer<I>::cancel_image_state_check_task()
-> void InstanceReplayer<I>::wait_for_ops()
-> void InstanceReplayer<I>::handle_wait_for_ops(int r)
-> void InstanceReplayer<I>::stop_image_replayers()
...
-> void NamespaceReplayer<I>::handle_release_leader(Context *on_finish)
-> void NamespaceReplayer<I>::shut_down_image_deleter(Context* on_finish)
-> void NamespaceReplayer<I>::handle_shut_down_image_deleter(Context* on_finish)
-> void NamespaceReplayer<I>::shut_down_pool_watchers(Context *on_finish)
-> void NamespaceReplayer<I>::handle_shut_down_pool_watchers()
-> void NamespaceReplayer<I>::shut_down_image_map(Context *on_finish)
-> void NamespaceReplayer<I>::handle_shut_down_image_map()
-> void InstanceReplayer<I>::release_all(Context *on_finish)
-> void InstanceReplayer<I>::stop_image_replayer()
...
This is exactly where the `ImageReplayer<I>::handle_shut_down()` will reach "stop complete" i.e. finally m_state = STATE_STOPPED, but this is too late and the pool replayer and few other replayers are on their way winding-off.
2023-06-06T16:55:02.322+0530 7f1b019c4640 10 rbd::mirror::ImageReplayer: 0x5560178dfb00 [1/30158057-9bf1-49ca-8780-a4b2d0e502e2] handle_shut_down: stop complete
Trying to `pool_replayer->init(site_name)` leads to failure in the init_rados for the remote client:
void PoolReplayer<I>::init(const std::string& site_name)
-> int PoolReplayer<I>::init_rados()
-> r = (*rados_ref)->connect();
if (r < 0) {
derr << "error connecting to " << description << ": "
<< cpp_strerror(r) << dendl;
return;
}
2023-05-29T17:31:45.710+0530 7fd02101a840 -1 rbd::mirror::PoolReplayer: 0x55ac9ec5d180 init_rados: error connecting to remote peer uuid: 1ce59938-8e03-41f8-8d41-38aa51882026 cluster: f6852128-353e-4c71-8b2d-ab84c8a62229 client: client.rbd-mirror-peer: (110) Connection timed out
Hence it returns from there every time `Mirror::update_pool_replayers()` is invoked, probably forever? (as I have destroyed the site-a)!!!! As a result PoolReplayer is never started for site-a.
Scenario 2: Different states but not "desc: local image is primary"
i.e. it jumps b/w various states:
1. {state=up+stopped, description=force promoted}
2. {state=up+starting_replay, description=starting replay}
3. {state=up+stopped, description=stopped}
2023-06-06T15:25:55.423+0530 7f07e573c640 10 rbd::mirror::ImageReplayer: 0x5606d6f45b00 [1/86183050-c11a-4046-aadb-1668fad83e58] handle_shut_down: stop complete
2023-06-06T15:26:09.828+0530 7f07ef750640 10 rbd::mirror::InstanceReplayer: 0x5606d64ca820 queue_start_image_replayers:
2023-06-06T15:26:09.828+0530 7f07f2756640 10 rbd::mirror::InstanceReplayer: 0x5606d64ca820 start_image_replayers:
2023-06-06T15:26:09.828+0530 7f07f2756640 10 rbd::mirror::InstanceReplayer: 0x5606d64ca820 start_image_replayers: PRASANNAXXX: staring image replayer
2023-06-06T15:26:09.828+0530 7f07f2756640 10 rbd::mirror::InstanceReplayer: 0x5606d64ca820 start_image_replayer: global_image_id=86183050-c11a-4046-aadb-1668fad83e58
2023-06-06T15:26:09.828+0530 7f07f2756640 15 rbd::mirror::MirrorStatusUpdater 0x5606d14bb1e0 set_mirror_image_status: global_image_id=86183050-c11a-4046-aadb-1668fad83e58, mirror_image_site_status={state=up+starting_replay, description=starting replay, last_update=0.000000]}
[...]
2023-06-06T15:26:09.874+0530 7f07e573c640 5 rbd::mirror::PoolReplayer: 0x5606d14d3180 handle_remote_pool_meta_updated: remote_pool_meta=mirror_uuid=, mirror_peer_uuid=
2023-06-06T15:26:09.874+0530 7f07e573c640 -1 rbd::mirror::PoolReplayer: 0x5606d14d3180 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly
In this Scenario `ImageReplayer<I>::handle_shut_down` reaches "stop complete" i.e. m_state = STATE_STOPPED even before `PoolReplayer<I>::handle_remote_pool_meta_updated` sets `m_stopping = true;` hence `update_pool_replayers` will not restart pool replayer. As a result the next,
queue_start_image_replayers() -> start_image_replayers -> start_image_replayer() actually succeeds and starts a new image replayer, when flips the changes from {state=up+stopped, description=force promoted} -> {state=up+starting_replay, description=starting replay}
Once the `PoolReplayer<I>::handle_remote_pool_meta_updated` sets `m_stopping = true;` the upcoming `Mirror::update_pool_replayers()` will try to restart the PoolReplayer, as part of the it will again invoke `ImageReplayer<I>::handle_shut_down` which will set {state=up+stopped, description=stopped}
Post this will endup with init_rados() failures like in the previous scenario.
Summary:
- Its this way, because we do not have a way to order/synchronize the shut downs of various components like, ImageReplayer, PoolReplayer, NameSpaceReplayer, InstanceReplayer, RemotePoolPoller etc ... i.e. once the PoolReplayer is set to shut_down (As part of the update_pool_replayers) we do not wait for anything, we simply will invoke the ImageReplayer shut_down (and call other components shut downs)
- Also note if the `rbd_mirror_pool_replayers_refresh_interval` is tweaked to < 30 (current default) the ratio of hitting `Scenario 1` increases.
- Right now, we do not cleanup the remote client but rather fail in the init_rados() of remote site as part of the PoolReplayer initialization.
Please find the log files for both the cases, attached.