Bug #54258
closedqa/workunits/rbd/rbd_mirror_stress.sh: rbd: mirroring not enabled on the image
0%
Description
rbd_mirror_stress.sh sporadically fails on master like in this case:
2022-02-11T11:01:18.039 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ CEPH_ARGS= rbd --cluster cluster1 mirror image status mirror/test 2022-02-11T11:01:18.040 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ tee /tmp/tmp.XK426RdY94/cluster1-mirror-test.mirror_status 2022-02-11T11:01:18.095 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:rbd: mirroring not enabled on the image 2022-02-11T11:01:18.104 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ grep ^ state: .*up+replaying /tmp/tmp.XK426RdY94/cluster1-mirror-test.mirror_status 2022-02-11T11:01:18.106 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ return 1 2022-02-11T11:01:18.107 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ return 1 2022-02-11T11:01:18.107 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ cleanup 1
I.e. the test tries to to get the mirroring non-primary image status and can't find the status (in cluster1).
And when we look at the test diagnostic output, we see for "mirror status" in cluster1:
2022-02-11T11:01:19.467 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:test: 2022-02-11T11:01:19.467 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: global_id: c7006e3b-7e00-4cf0-bb62-e0bad69b2003 2022-02-11T11:01:19.467 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: state: down+unknown 2022-02-11T11:01:19.467 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: description: status not found 2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: last_update: 2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: peer_sites: 2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: name: cluster2 2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: state: down+unknown 2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: description: status not found 2022-02-11T11:01:19.469 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: last_update: 2022-02-11T11:01:19.471 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ echo
While in cluster2 it is as expected:
2022-02-11T11:01:22.497 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:test: 2022-02-11T11:01:22.497 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: global_id: c7006e3b-7e00-4cf0-bb62-e0bad69b2003 2022-02-11T11:01:22.498 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: state: up+stopped 2022-02-11T11:01:22.498 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: description: local image is primary 2022-02-11T11:01:22.498 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: service: mirror.0 on smithi106 2022-02-11T11:01:22.498 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: last_update: 2022-02-11 11:01:16 2022-02-11T11:01:22.499 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: peer_sites: 2022-02-11T11:01:22.499 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: name: cluster1 2022-02-11T11:01:22.499 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: state: up+replaying 2022-02-11T11:01:22.499 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1644577159,"replay_state":"idle"} 2022-02-11T11:01:22.500 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout: last_update: 2022-02-11 11:01:16
Updated by Mykola Golub about 2 years ago
In the rbd-mirror log I see that the both local and remote status updaters send the status:
2022-02-11T11:01:11.623+0000 7f10aa7fd700 15 rbd::mirror::ImageReplayer: 0x55aeefd1c280 [3/c7006e3b-7e00-4cf0-bb62-e0bad69b2003] set_mirror_image_status_update: status={state=up+replaying, description=replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1644577159,"replay_state":"idle"}, last_update=0.000000]} 2022-02-11T11:01:11.623+0000 7f10aa7fd700 15 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 set_mirror_image_status: global_image_id=c7006e3b-7e00-4cf0-bb62-e0bad69b2003, mirror_image_site_status={state=up+replaying, description=replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1644577159,"replay_state":"idle"}, last_update=0.000000]} 2022-02-11T11:01:11.623+0000 7f10aa7fd700 15 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 set_mirror_image_status: global_image_id=c7006e3b-7e00-4cf0-bb62-e0bad69b2003, mirror_image_site_status={state=up+replaying, description=replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1644577159,"replay_state":"idle"}, last_update=0.000000]} 2022-02-11T11:01:16.019+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 handle_timer_task: 2022-02-11T11:01:16.019+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 schedule_timer_task: 2022-02-11T11:01:16.019+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 queue_update_task: 2022-02-11T11:01:16.019+0000 7f10a9ffc700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 update_task: 2022-02-11T11:01:16.023+0000 7f10a37ef700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 handle_update_task: 2022-02-11T11:01:16.023+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 handle_timer_task: 2022-02-11T11:01:16.023+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 schedule_timer_task: 2022-02-11T11:01:16.023+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 queue_update_task: 2022-02-11T11:01:16.023+0000 7f10aa7fd700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 update_task: 2022-02-11T11:01:16.027+0000 7f109cfe2700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 handle_update_task:
But from `rados --cluster cluster1 -p mirror --namespace listomapvals rbd_mirroring` one can see that is not stored in the local (cluster1) by some reason.
Updated by Mykola Golub about 2 years ago
- Status changed from New to In Progress
- Assignee set to Mykola Golub
Updated by Mykola Golub about 2 years ago
The problem seems to be caused by a race when a peer is registered both by the rbd-mirror "status ping" and the test setup script. After [1] was fixed the test setup always wins, but the rbd-mirror may still use in-memory (outdated) mirror uuid, and the replayer skips mirroring snapshots and does not update the image mirroring state to "ENABLED".
Below are details based on the case [2].
For snapshot based mirroring the "enable" request set the image state to "enabling" state and a mirroring snapshot is created. It is expected that when the replayer is run it will detect the primary mirroring snapshot, create the corresponding non-primary snapshot and update the mirroring state to "enabled" (see rbd::mirror::image_replayer::snapshot::Replayer::handle_update_mirror_image_state). But in the test case we see:
2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 handle_load_local_image_meta: r=-2 2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_local_mirror_snapshots: 2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_remote_mirror_snapshots: 2022-02-12T19:05:17.776+0000 7f51bbee7700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_remote_mirror_snapshots: remote mirror snapshot: id=4, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=b6da8fb9-9aa2-4977-be4e-1f70853a9f10, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}] 2022-02-12T19:05:17.776+0000 7f51bbee7700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_remote_mirror_snapshots: skipping remote snapshot due to missing mirror peer 2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_remote_mirror_snapshots: all remote snapshots synced: idling waiting for new snapshot 2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 notify_status_updated:
I.e. the primary snapshot was skipped because the peer uuid b6da8fb9-9aa2-4977-be4e-1f70853a9f10 did not match what the rbd-mirror was thinking its own uuid was.
And if we look at the test log when it was configuring the peer, we can see:
2022-02-12T19:05:12.352 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ rbd --cluster cluster2 mirror pool peer add mirror cluster1 2022-02-12T19:05:12.385 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:rbd: mirror peer already exists 2022-02-12T19:05:12.388 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ peer_uuid= 2022-02-12T19:05:12.389 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ sleep 1 2022-02-12T19:05:13.391 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ rbd mirror pool info --cluster cluster2 --pool mirror --format xml 2022-02-12T19:05:13.391 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ xmlstarlet sel -t -v //peers/peer[site_name='cluster1']/uuid 2022-02-12T19:05:13.447 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ peer_uuid=d1d0f004-da17-4438-89d1-c516656e2382 2022-02-12T19:05:13.447 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ CEPH_ARGS= rbd --cluster cluster2 --pool mirror mirror pool peer remove d1d0f004-da17-4438-89d1-c516656e2382 2022-02-12T19:05:13.503 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ set +e 2022-02-12T19:05:13.503 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ rbd --cluster cluster2 mirror pool peer add mirror cluster1 2022-02-12T19:05:13.544 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ peer_uuid=b6da8fb9-9aa2-4977-be4e-1f70853a9f10
I.e. after enabling mirroring, when adding a peer, it detected that peer had been already created automatically by rbd-mirror "status ping" and the test recreated it because it was tx-only (see [1] for more details about description of this case). After setting the peer we do not send any notifications so rbd-mirror could detect metadata change, and it still uses the old peer. Only after the test failed we may see that the rbd-mirror eventually detected it:
2022-02-12T19:05:42.179 INFO:tasks.rbd_mirror.cluster1.client.mirror.0.smithi149.stderr:2022-02-12T19:05:42.172+0000 7f3d01513700 -1 rbd::mirror::PoolReplayer: 0x55eb555e4c00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly 2022-02-12T19:05:42.214 INFO:tasks.rbd_mirror.cluster1.client.mirror.0.smithi149.stderr:2022-02-12T19:05:42.208+0000 7f3d1ed91680 -1 rbd::mirror::Mirror: 0x55eb53155c20 update_pool_replayers: restarting failed pool replayer for uuid: e0c8605b-d07c-4463-bc90-8dfbf537c0b0 cluster: cluster2 client: client.mirror 2022-02-12T19:05:42.249 INFO:tasks.rbd_mirror.cluster1.client.mirror.1.smithi149.stderr:2022-02-12T19:05:42.244+0000 7f51b56da700 -1 rbd::mirror::PoolReplayer: 0x563fed9d3b00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly 2022-02-12T19:05:42.352 INFO:tasks.rbd_mirror.cluster1.client.mirror.1.smithi149.stderr:2022-02-12T19:05:42.348+0000 7f51a86c0700 -1 rbd::mirror::PoolReplayer: 0x563fefda0f00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly 2022-02-12T19:05:43.405 INFO:tasks.rbd_mirror.cluster1.client.mirror.2.smithi149.stderr:2022-02-12T19:05:43.400+0000 7f4b591ed700 -1 rbd::mirror::PoolReplayer: 0x563828f07800 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly 2022-02-12T19:05:43.428 INFO:tasks.rbd_mirror.cluster1.client.mirror.3.smithi149.stderr:2022-02-12T19:05:43.424+0000 7f59250fd700 -1 rbd::mirror::PoolReplayer: 0x556b6ee3fb00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly 2022-02-12T19:05:43.514 INFO:tasks.rbd_mirror.cluster1.client.mirror.2.smithi149.stderr:2022-02-12T19:05:43.508+0000 7f4b4b1d1700 -1 rbd::mirror::PoolReplayer: 0x56382b357200 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly 2022-02-12T19:05:43.549 INFO:tasks.rbd_mirror.cluster1.client.mirror.3.smithi149.stderr:2022-02-12T19:05:43.544+0000 7f59178e2700 -1 rbd::mirror::PoolReplayer: 0x556b71218f00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly
I suppose the problem was not seen previously because before [3] was merged we allowed to set mirroring image status even if it was not in "enabled" state.
[1] https://tracker.ceph.com/issues/44938
[2] http://qa-proxy.ceph.com/teuthology/trociny-2022-02-12_18:36:33-rbd-wip-mgolub-testing-distro-basic-smithi/6678662/
[3]https://github.com/ceph/ceph/pull/41696
Updated by Mykola Golub about 2 years ago
Digging it further, the problem is not that the rbd-mirror detected the peer change too late. Actually, after detecting it ("handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly") it had enough time to restart the pool replayer before the test aborted, but it did not do this:
When a PoolReplayer detects remote pool metadata change it sets "stopping" flag expecting the Mirror will restart it. Although setting "stopping" flag makes the PoolReplayer::run thread to terminate, the thread's is_started function will still return true until join is called (and reset the thread id). This made impossible for the Mirror to detect (by calling PoolReplayer::is_running) that the PoolReplayer needed restart.
Updated by Mykola Golub about 2 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 45086
Updated by Ilya Dryomov about 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 2 years ago
- Copied to Backport #54381: pacific: qa/workunits/rbd/rbd_mirror_stress.sh: rbd: mirroring not enabled on the image added
Updated by Backport Bot about 2 years ago
- Copied to Backport #54382: octopus: qa/workunits/rbd/rbd_mirror_stress.sh: rbd: mirroring not enabled on the image added
Updated by Ilya Dryomov almost 2 years ago
- Status changed from Pending Backport to Resolved