Project

General

Profile

Actions

Bug #54098

closed

[rbd-mirror] hangs forever after split brain, while searching for demotion snapshot

Added by Deepika Upadhyay about 2 years ago. Updated almost 2 years ago.

Status:
Duplicate
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

this is not readily reproducible in ceph dev env, but the operations that leads to this condition:

perform IO workload for substantial time(~1hr) on primary cluster c1
while workload is running failover to secondary cluster c2; demote the image

this leads to split-brain:

2021-11-19T13:45:06.549+0000 7f1eccd83700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: split-brain detected:              failed to find matching non-primary snapshot in remote image: local_snap_id_start=9339, local_snap_ns=[mirror state=primary (demoted), complete=1,                            mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
   586581 2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 notify_status_updated:

this leads to some missing demotion snapshots that are never found:

2021-11-19T15:19:06.108+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd95eb6400 scan_remote_mirror_snapshots: remote mirror snapshot: id=7451, mirror_ns=[mirror state=non-primary (demoted), complete=1, mirror_peer_uuids=e0e53715-2be7-46e3-8001-d8048321babc, primary_mirror_uuid=ad4d4f8b-4b43-49e3-8d83-768c763e9a4b, primary_snap_id=1359, last_copied_object_number=12800, snap_seqs={4953=18446744073709551614}]
2021-11-19T15:19:06.108+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd95eb6400 scan_remote_mirror_snapshots: skipping remote snapshot 7451 while searching for demotion

complete logs: https://drive.google.com/file/d/1FfVT84daBKZgXJkwuWbhDUZQzEiaI8pU/view?usp=sharing


Related issues 1 (0 open1 closed)

Is duplicate of rbd - Bug #54448: [rbd-mirror] "failed to unlink local peer from remote image" due to EROFS errorResolvedIlya Dryomov

Actions
Actions #1

Updated by Mykola Golub about 2 years ago

Deepika, I am not sure about the described scenario. You wrote that the image was demoted (on the primary cluster). But was it promoted on the secondary?

Because this is what I see glancing through the log.

The part of the log you referred to:

2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: 
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: local mirror snapshot: id=9318, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: local mirror snapshot: id=9336, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: local mirror snapshot: id=9339, mirror_ns=[mirror state=primary (demoted), complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_local_mirror_snapshots: found local mirror snapshot: local_snap_id_start=9339, local_snap_id_end=18446744073709551614, local_snap_ns=[mirror state=primary (demoted), complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: 
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: remote mirror snapshot: id=7448, mirror_ns=[mirror state=non-primary (demoted), complete=1, mirror_peer_uuids=e0e53715-2be7-46e3-8001-d8048321babc, primary_mirror_uuid=ad4d4f8b-4b43-49e3-8d83-768c763e9a4b, primary_snap_id=134d, last_copied_object_number=6144, snap_seqs={4941=18446744073709551614}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: skipping remote snapshot 7448 while searching for demotion
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: remote mirror snapshot: id=9315, mirror_ns=[mirror state=non-primary, complete=1, mirror_peer_uuids=, primary_mirror_uuid=ad4d4f8b-4b43-49e3-8d83-768c763e9a4b, primary_snap_id=2466, last_copied_object_number=6144, snap_seqs={9318=18446744073709551614}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: skipping remote snapshot due to missing mirror peer
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snashots: remote mirror snapshot: id=9330, mirror_ns=[mirror state=non-primary, complete=0, mirror_peer_uuids=, primary_mirror_uuid=ad4d4f8b-4b43-49e3-8d83-768c763e9a4b, primary_snap_id=2478, last_copied_object_number=1934, snap_seqs={9336=18446744073709551614}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: skipping remote snapshot due to missing mirror peer
2021-11-19T13:45:06.549+0000 7f1eccd83700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 scan_remote_mirror_snapshots: split-brain detected: failed to find matching non-primary snapshot in remote image: local_snap_id_start=9339, local_snap_ns=[mirror state=primary (demoted), complete=1, mirror_peer_uuids=c3249fb1-852f-4253-8ffa-cc4117d17a21, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2021-11-19T13:45:06.549+0000 7f1eccd83700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55dd936b6000 notify_status_updated: 
2021-11-19T13:45:06.549+0000 7f1ef2dcf700 10 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] handle_replayer_notification: 
2021-11-19T13:45:06.549+0000 7f1ef2dcf700 10 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] handle_replayer_notification: replay interrupted: r=-17, error=split-brain

The local (cluster1, previously primary) image has older primary snapshots and then "primary(demoted)" and the last "non-primary (demoted)", which looks like ok and demotion was successful. But for the remote image (cluster1, previously secondary) I would expect to see the last snapshot in "primary" state, which would indicate the promotion succeeded, but actually I see only "non-primary" snapshots. The log message "skipping remote snapshot due to missing mirror peer" might be missliding but is expected here. As it is a non-primary snapshot it does not have mirror peers. They are set only for primary snapshot. Ideally, it should check if this is primary snapshot first to provide more meaningful log message, but this is just cosmetic.

So, I believe at this time the image is demoted on the primary and not promoted yet on the secondary and the ImageReplayer failed with "split-brain" error. But this is expected. In 30 sec it tries to bootstrap again (in case the image is promoted already). And for this try we see in the logs:

2021-11-19T13:45:36.074+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::PrepareReplayRequest: 0x55dd958d52c0 load_local_image_meta: 
2021-11-19T13:45:36.074+0000 7f1ef25ce700 15 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] set_mirror_image_status_update: force=0, state=--
2021-11-19T13:45:36.074+0000 7f1ef25ce700 15 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] set_mirror_image_status_update: status={state=up+error, descr
iption=starting replay, last_update=0.000000]}
2021-11-19T13:45:36.074+0000 7f1ef25ce700 15 rbd::mirror::MirrorStatusUpdater 0x55dd939b30e0 set_mirror_image_status: global_image_id=98f08d05-d04f-446c-8161-a0fec97ffe43, mirror_image_site_
status={state=up+error, description=starting replay, last_update=0.000000]}
2021-11-19T13:45:36.074+0000 7f1ef25ce700 15 rbd::mirror::MirrorStatusUpdater 0x55dd918546c0 set_mirror_image_status: global_image_id=98f08d05-d04f-446c-8161-a0fec97ffe43, mirror_image_site_
status={state=up+error, description=starting replay, last_update=0.000000]}
2021-11-19T13:45:36.075+0000 7f1eccd83700 15 rbd::mirror::image_replayer::snapshot::PrepareReplayRequest: 0x55dd958d52c0 handle_load_local_image_meta: r=0
2021-11-19T13:45:36.075+0000 7f1eccd83700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x55dd9164a820 handle_prepare_replay: r=0
2021-11-19T13:45:36.075+0000 7f1eccd83700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x55dd9164a820 handle_prepare_replay: local image resync requested

It it detects that the image resync was requested. This makes the ImageReplayer stop, remove (move to the trash) the local image and start ImageReplayer again, so the bootstrap would start the resync. But this time it (expectedly) detects that the remote image is non-primary:

2021-11-19T13:46:06.032+0000 7f1ec6576700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x55dd8ff37380 handle_prepare_remote_image: r=-121
2021-11-19T13:46:06.032+0000 7f1ec6576700 10 rbd::mirror::image_replayer::BootstrapRequest: 0x55dd8ff37380 handle_prepare_remote_image: remote-image is non-primary(121) Remote I/O error
2021-11-19T13:46:06.032+0000 7f1ec6576700 10 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] handle_bootstrap: r=-121
2021-11-19T13:46:06.032+0000 7f1ec6576700  5 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] handle_bootstrap: remote image is non-primary
2021-11-19T13:46:06.032+0000 7f1ec6576700 10 rbd::mirror::ImageReplayer: 0x55dd9576ec80 [1/98f08d05-d04f-446c-8161-a0fec97ffe43] on_start_fail: r=-121, desc=remote image is non-primary

And sure it will fail.

So to me from the log it looks like someone demoted the primary image, and, instead of promoting it on the secondary, requested resync. And the resync in this situation expectedly failed (no primary). I might be missing something and the real scenario was different, that is why I am interested to learn it.

If it was indeed this scenario (resync after demote), then the test case looks wrong. The rbd-mirror might be improved though not to try resync (deleting the local image) if the remote is not primary, but this is a different story.

Actions #2

Updated by Deepika Upadhyay about 2 years ago

@Mykola, thanks, yes this test case was a bit different to the local issue I observed:

- Workload was running on C1
- Workload was failed over(force promoted on C2)
- C1 issued a resync (RBD images on C1 demoted and issued a resync)
- As it was detected to be in a split brain situation (due to the above force promotion)

I suppose they are manually issuing a resync after demotion:
https://bugzilla.redhat.com/show_bug.cgi?id=2019931#c2

The user requested the workload to be relocated to the preferred cluster. In summary, during the relocation, the VR is requested by VRG to move VR state to secondary.  During that process, we demote the image, we resync if necessary, and then report success.

This was observed when doing failback and failover, but I am not sure if correct order was followed. I will check with them.

Actions #3

Updated by Mykola Golub about 2 years ago

So, I think they just did it wrong: resync after demote. I suppose they just forgot to promote (or were not aware this step was necessary). I expect resync to work after promote, though normally it is not needed at all.

It looks like if the image in non-primary state on both clusters the rbd-mirror will register split-brain status, which is resolved automatically when the image is promoted on one of the sites. Note, I have not tried it to reproduce to be 100% sure about the behaviour, this is what I see from their logs and the code. This is probably something that could be improved (i.e. do not report split-brain error in this situation) but this is rather a minor (cosmetic) issue that does not affect functionality.

Actions #4

Updated by Deepika Upadhyay about 2 years ago

  • Pull request ID set to 44957
Actions #5

Updated by Deepika Upadhyay about 2 years ago

  • Status changed from New to Fix Under Review
Actions #6

Updated by Christopher Hoffman about 2 years ago

  • Assignee set to Deepika Upadhyay
Actions #7

Updated by Ilya Dryomov almost 2 years ago

Mykola Golub wrote:

So, I think they just did it wrong: resync after demote. I suppose they just forgot to promote (or were not aware this step was necessary). I expect resync to work after promote, though normally it is not needed at all.

With https://tracker.ceph.com/issues/54448, such dangerous resyncs are effectively postponed until one of the images is promoted.

It looks like if the image in non-primary state on both clusters the rbd-mirror will register split-brain status, which is resolved automatically when the image is promoted on one of the sites. Note, I have not tried it to reproduce to be 100% sure about the behaviour, this is what I see from their logs and the code. This is probably something that could be improved (i.e. do not report split-brain error in this situation) but this is rather a minor (cosmetic) issue that does not affect functionality.

Reporting split-brain in this case is indeed undesirable. With https://tracker.ceph.com/issues/54448, "remote image is not primary" is reported for both images.

Actions #8

Updated by Ilya Dryomov almost 2 years ago

  • Is duplicate of Bug #54448: [rbd-mirror] "failed to unlink local peer from remote image" due to EROFS error added
Actions #9

Updated by Ilya Dryomov almost 2 years ago

  • Status changed from Fix Under Review to Duplicate
  • Pull request ID deleted (44957)
Actions

Also available in: Atom PDF