Deepika, I am not sure about the described scenario. You wrote that the image was demoted (on the primary cluster). But was it promoted on the secondary?
Because this is what I see glancing through the log.
The part of the log you referred to:
2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots:
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: local mirror snapshot: id=9318, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: local mirror snapshot: id=9336, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: local mirror snapshot: id=9339, mirror_ns=[mirror state=primary (demoted), complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: found local mirror snapshot: local_snap_id_start=9339, local_snap_id_end=18446744073709551614, local_snap_ns=[mirror state=primary (demoted), complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots:
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: remote mirror snapshot: id=7448, mirror_ns=[mirror state=non-primary (demoted), complete=1, mirror_peer_uuids=e0e53715-2be7-46e3-8001-d8048321babc, primary_mirror_uuid=ad4d4f8b-4b43-49e3-8d83-768c763e9a4b, primary_snap_id=134d, last_copied_object_number=6144, snap_seqs={4941=18446744073709551614}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: skipping remote snapshot 7448 while searching for demotion
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: remote mirror snapshot: id=9315, mirror_ns=[mirror state=non-primary, complete=1, mirror_peer_uuids=, primary_mirror_uuid=ad4d4f8b-4b43-49e3-8d83-768c763e9a4b, primary_snap_id=2466, last_copied_object_number=6144, snap_seqs={9318=18446744073709551614}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: skipping remote snapshot due to missing mirror peer
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snashots: remote mirror snapshot: id=9330, mirror_ns=[mirror state=non-primary, complete=0, mirror_peer_uuids=, primary_mirror_uuid=ad4d4f8b-4b43-49e3-8d83-768c763e9a4b, primary_snap_id=2478, last_copied_object_number=1934, snap_seqs={9336=18446744073709551614}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: skipping remote snapshot due to missing mirror peer
2021-11-19T13:45:06.549+0000 7f1eccd83700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: split-brain detected: failed to find matching non-primary snapshot in remote image: local_snap_id_start=9339, local_snap_ns=[mirror state=primary (demoted), complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 notify_status_updated:
2021-11-19T13:45:06.549+0000 7f1ef2dcf700 10 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] handle_replayer_notification:
2021-11-19T13:45:06.549+0000 7f1ef2dcf700 10 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] handle_replayer_notification: replay interrupted: r=-17, error=split-brain
The local (cluster1, previously primary) image has older primary snapshots and then "primary(demoted)" and the last "non-primary (demoted)", which looks like ok and demotion was successful. But for the remote image (cluster1, previously secondary) I would expect to see the last snapshot in "primary" state, which would indicate the promotion succeeded, but actually I see only "non-primary" snapshots. The log message "skipping remote snapshot due to missing mirror peer" might be missliding but is expected here. As it is a non-primary snapshot it does not have mirror peers. They are set only for primary snapshot. Ideally, it should check if this is primary snapshot first to provide more meaningful log message, but this is just cosmetic.
So, I believe at this time the image is demoted on the primary and not promoted yet on the secondary and the ImageReplayer failed with "split-brain" error. But this is expected. In 30 sec it tries to bootstrap again (in case the image is promoted already). And for this try we see in the logs:
2021-11-19T13:45:36.074+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::PrepareReplayRequest: 0x55dd958d52c0 load_local_image_meta:
2021-11-19T13:45:36.074+0000 7f1ef25ce700 15 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] set_mirror_image_status_update: force=0, state=--
2021-11-19T13:45:36.074+0000 7f1ef25ce700 15 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] set_mirror_image_status_update: status={state=up+error, descr
iption=starting replay, last_update=0.000000]}
2021-11-19T13:45:36.074+0000 7f1ef25ce700 15 rbd::mirror::MirrorStatusUpdater 0x55dd939b30e0 set_mirror_image_status: global_image_id=98f08d05-d04f-446c-8161-a0fec97ffe43, mirror_image_site_
status={state=up+error, description=starting replay, last_update=0.000000]}
2021-11-19T13:45:36.074+0000 7f1ef25ce700 15 rbd::mirror::MirrorStatusUpdater 0x55dd918546c0 set_mirror_image_status: global_image_id=98f08d05-d04f-446c-8161-a0fec97ffe43, mirror_image_site_
status={state=up+error, description=starting replay, last_update=0.000000]}
2021-11-19T13:45:36.075+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::PrepareReplayRequest: 0x55dd958d52c0 handle_load_local_image_meta: r=0
2021-11-19T13:45:36.075+0000 7f1eccd83700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x55dd9164a820 handle_prepare_replay: r=0
2021-11-19T13:45:36.075+0000 7f1eccd83700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x55dd9164a820 handle_prepare_replay: local image resync requested
It it detects that the image resync was requested. This makes the ImageReplayer stop, remove (move to the trash) the local image and start ImageReplayer again, so the bootstrap would start the resync. But this time it (expectedly) detects that the remote image is non-primary:
2021-11-19T13:46:06.032+0000 7f1ec6576700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x55dd8ff37380 handle_prepare_remote_image: r=-121
2021-11-19T13:46:06.032+0000 7f1ec6576700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x55dd8ff37380 handle_prepare_remote_image: remote-image is non-primary(121) Remote I/O error
2021-11-19T13:46:06.032+0000 7f1ec6576700 10 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] handle_bootstrap: r=-121
2021-11-19T13:46:06.032+0000 7f1ec6576700 5 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] handle_bootstrap: remote image is non-primary
2021-11-19T13:46:06.032+0000 7f1ec6576700 10 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] on_start_fail: r=-121, desc=remote image is non-primary
And sure it will fail.
So to me from the log it looks like someone demoted the primary image, and, instead of promoting it on the secondary, requested resync. And the resync in this situation expectedly failed (no primary). I might be missing something and the real scenario was different, that is why I am interested to learn it.
If it was indeed this scenario (resync after demote), then the test case looks wrong. The rbd-mirror might be improved though not to try resync (deleting the local image) if the remote is not primary, but this is a different story.