Bug #54258: qa/workunits/rbd/rbd_mirror_stress.sh: rbd: mirroring not enabled on the image - rbd - Ceph

Actions

Copy link

Bug #54258

closed

qa/workunits/rbd/rbd_mirror_stress.sh: rbd: mirroring not enabled on the image

Added by Mykola Golub about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Mykola Golub

Target version:

% Done:

Source:

Tags:

Backport:

octopus,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

45086

Crash signature (v1):

Crash signature (v2):

Description

rbd_mirror_stress.sh sporadically fails on master like in this case:

http://qa-proxy.ceph.com/teuthology/trociny-2022-02-11_10:28:15-rbd-master-distro-basic-smithi/6676388/teuthology.log

2022-02-11T11:01:18.039 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ CEPH_ARGS= rbd --cluster cluster1 mirror image status mirror/test
2022-02-11T11:01:18.040 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ tee /tmp/tmp.XK426RdY94/cluster1-mirror-test.mirror_status
2022-02-11T11:01:18.095 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:rbd: mirroring not enabled on the image
2022-02-11T11:01:18.104 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ grep ^  state: .*up+replaying /tmp/tmp.XK426RdY94/cluster1-mirror-test.mirror_status
2022-02-11T11:01:18.106 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ return 1
2022-02-11T11:01:18.107 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ return 1
2022-02-11T11:01:18.107 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ cleanup 1

I.e. the test tries to to get the mirroring non-primary image status and can't find the status (in cluster1).

And when we look at the test diagnostic output, we see for "mirror status" in cluster1:

2022-02-11T11:01:19.467 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:test:
2022-02-11T11:01:19.467 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  global_id:   c7006e3b-7e00-4cf0-bb62-e0bad69b2003
2022-02-11T11:01:19.467 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  state:       down+unknown
2022-02-11T11:01:19.467 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  description: status not found
2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  last_update:
2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  peer_sites:
2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:    name: cluster2
2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:    state: down+unknown
2022-02-11T11:01:19.468 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:    description: status not found
2022-02-11T11:01:19.469 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:    last_update:
2022-02-11T11:01:19.471 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stderr:+ echo

While in cluster2 it is as expected:

2022-02-11T11:01:22.497 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:test:
2022-02-11T11:01:22.497 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  global_id:   c7006e3b-7e00-4cf0-bb62-e0bad69b2003
2022-02-11T11:01:22.498 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  state:       up+stopped
2022-02-11T11:01:22.498 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  description: local image is primary
2022-02-11T11:01:22.498 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  service:     mirror.0 on smithi106
2022-02-11T11:01:22.498 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  last_update: 2022-02-11 11:01:16
2022-02-11T11:01:22.499 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:  peer_sites:
2022-02-11T11:01:22.499 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:    name: cluster1
2022-02-11T11:01:22.499 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:    state: up+replaying
2022-02-11T11:01:22.499 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:    description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1644577159,"replay_state":"idle"}
2022-02-11T11:01:22.500 INFO:tasks.workunit.cluster1.client.mirror.smithi106.stdout:    last_update: 2022-02-11 11:01:16

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Mykola Golub about 2 years ago

In the rbd-mirror log I see that the both local and remote status updaters send the status:

2022-02-11T11:01:11.623+0000 7f10aa7fd700 15 rbd::mirror::ImageReplayer: 0x55aeefd1c280 [3/c7006e3b-7e00-4cf0-bb62-e0bad69b2003] set_mirror_image_status_update: status={state=up+replaying, description=replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1644577159,"replay_state":"idle"}, last_update=0.000000]}
2022-02-11T11:01:11.623+0000 7f10aa7fd700 15 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 set_mirror_image_status: global_image_id=c7006e3b-7e00-4cf0-bb62-e0bad69b2003, mirror_image_site_status={state=up+replaying, description=replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1644577159,"replay_state":"idle"}, last_update=0.000000]}
2022-02-11T11:01:11.623+0000 7f10aa7fd700 15 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 set_mirror_image_status: global_image_id=c7006e3b-7e00-4cf0-bb62-e0bad69b2003, mirror_image_site_status={state=up+replaying, description=replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1644577159,"replay_state":"idle"}, last_update=0.000000]}
2022-02-11T11:01:16.019+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 handle_timer_task: 
2022-02-11T11:01:16.019+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 schedule_timer_task: 
2022-02-11T11:01:16.019+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 queue_update_task: 
2022-02-11T11:01:16.019+0000 7f10a9ffc700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 update_task: 
2022-02-11T11:01:16.023+0000 7f10a37ef700 10 rbd::mirror::MirrorStatusUpdater 0x55aef09aac60 handle_update_task: 
2022-02-11T11:01:16.023+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 handle_timer_task: 
2022-02-11T11:01:16.023+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 schedule_timer_task: 
2022-02-11T11:01:16.023+0000 7f10a6ff6700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 queue_update_task: 
2022-02-11T11:01:16.023+0000 7f10aa7fd700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 update_task: 
2022-02-11T11:01:16.027+0000 7f109cfe2700 10 rbd::mirror::MirrorStatusUpdater 0x55aeef116000 handle_update_task:

But from `rados --cluster cluster1 -p mirror --namespace listomapvals rbd_mirroring` one can see that is not stored in the local (cluster1) by some reason.

Actions

Copy link

Updated by Mykola Golub about 2 years ago

Status changed from New to In Progress
Assignee set to Mykola Golub

Actions

Copy link

Updated by Mykola Golub about 2 years ago

The problem seems to be caused by a race when a peer is registered both by the rbd-mirror "status ping" and the test setup script. After [1] was fixed the test setup always wins, but the rbd-mirror may still use in-memory (outdated) mirror uuid, and the replayer skips mirroring snapshots and does not update the image mirroring state to "ENABLED".

Below are details based on the case [2].

For snapshot based mirroring the "enable" request set the image state to "enabling" state and a mirroring snapshot is created. It is expected that when the replayer is run it will detect the primary mirroring snapshot, create the corresponding non-primary snapshot and update the mirroring state to "enabled" (see rbd::mirror::image_replayer::snapshot::Replayer::handle_update_mirror_image_state). But in the test case we see:

2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 handle_load_local_image_meta: r=-2
2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_local_mirror_snapshots: 
2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_remote_mirror_snapshots: 
2022-02-12T19:05:17.776+0000 7f51bbee7700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_remote_mirror_snapshots: remote mirror snapshot: id=4, mirror_ns=[mirror state=primary, complete=1, mirror_peer_uuids=b6da8fb9-9aa2-4977-be4e-1f70853a9f10, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}]
2022-02-12T19:05:17.776+0000 7f51bbee7700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_remote_mirror_snapshots: skipping remote snapshot due to missing mirror peer
2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 scan_remote_mirror_snapshots: all remote snapshots synced: idling waiting for new snapshot
2022-02-12T19:05:17.776+0000 7f51bbee7700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x563ff1715400 notify_status_updated:

I.e. the primary snapshot was skipped because the peer uuid b6da8fb9-9aa2-4977-be4e-1f70853a9f10 did not match what the rbd-mirror was thinking its own uuid was.

And if we look at the test log when it was configuring the peer, we can see:

2022-02-12T19:05:12.352 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ rbd --cluster cluster2 mirror pool peer add mirror cluster1
2022-02-12T19:05:12.385 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:rbd: mirror peer already exists
2022-02-12T19:05:12.388 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ peer_uuid=

2022-02-12T19:05:12.389 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ sleep 1
2022-02-12T19:05:13.391 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ rbd mirror pool info --cluster cluster2 --pool mirror --format xml
2022-02-12T19:05:13.391 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ xmlstarlet sel -t -v //peers/peer[site_name='cluster1']/uuid
2022-02-12T19:05:13.447 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ peer_uuid=d1d0f004-da17-4438-89d1-c516656e2382
2022-02-12T19:05:13.447 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ CEPH_ARGS= rbd --cluster cluster2 --pool mirror mirror pool peer remove d1d0f004-da17-4438-89d1-c516656e2382
2022-02-12T19:05:13.503 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ set +e
2022-02-12T19:05:13.503 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ rbd --cluster cluster2 mirror pool peer add mirror cluster1
2022-02-12T19:05:13.544 INFO:tasks.workunit.cluster1.client.mirror.smithi149.stderr:+ peer_uuid=b6da8fb9-9aa2-4977-be4e-1f70853a9f10

I.e. after enabling mirroring, when adding a peer, it detected that peer had been already created automatically by rbd-mirror "status ping" and the test recreated it because it was tx-only (see [1] for more details about description of this case). After setting the peer we do not send any notifications so rbd-mirror could detect metadata change, and it still uses the old peer. Only after the test failed we may see that the rbd-mirror eventually detected it:

2022-02-12T19:05:42.179 INFO:tasks.rbd_mirror.cluster1.client.mirror.0.smithi149.stderr:2022-02-12T19:05:42.172+0000 7f3d01513700 -1 rbd::mirror::PoolReplayer: 0x55eb555e4c00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly
2022-02-12T19:05:42.214 INFO:tasks.rbd_mirror.cluster1.client.mirror.0.smithi149.stderr:2022-02-12T19:05:42.208+0000 7f3d1ed91680 -1 rbd::mirror::Mirror: 0x55eb53155c20 update_pool_replayers: restarting failed pool replayer for uuid: e0c8605b-d07c-4463-bc90-8dfbf537c0b0 cluster: cluster2 client: client.mirror
2022-02-12T19:05:42.249 INFO:tasks.rbd_mirror.cluster1.client.mirror.1.smithi149.stderr:2022-02-12T19:05:42.244+0000 7f51b56da700 -1 rbd::mirror::PoolReplayer: 0x563fed9d3b00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly
2022-02-12T19:05:42.352 INFO:tasks.rbd_mirror.cluster1.client.mirror.1.smithi149.stderr:2022-02-12T19:05:42.348+0000 7f51a86c0700 -1 rbd::mirror::PoolReplayer: 0x563fefda0f00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly
2022-02-12T19:05:43.405 INFO:tasks.rbd_mirror.cluster1.client.mirror.2.smithi149.stderr:2022-02-12T19:05:43.400+0000 7f4b591ed700 -1 rbd::mirror::PoolReplayer: 0x563828f07800 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly
2022-02-12T19:05:43.428 INFO:tasks.rbd_mirror.cluster1.client.mirror.3.smithi149.stderr:2022-02-12T19:05:43.424+0000 7f59250fd700 -1 rbd::mirror::PoolReplayer: 0x556b6ee3fb00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly
2022-02-12T19:05:43.514 INFO:tasks.rbd_mirror.cluster1.client.mirror.2.smithi149.stderr:2022-02-12T19:05:43.508+0000 7f4b4b1d1700 -1 rbd::mirror::PoolReplayer: 0x56382b357200 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly
2022-02-12T19:05:43.549 INFO:tasks.rbd_mirror.cluster1.client.mirror.3.smithi149.stderr:2022-02-12T19:05:43.544+0000 7f59178e2700 -1 rbd::mirror::PoolReplayer: 0x556b71218f00 handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly

I suppose the problem was not seen previously because before [3] was merged we allowed to set mirroring image status even if it was not in "enabled" state.

[1] https://tracker.ceph.com/issues/44938
[2] http://qa-proxy.ceph.com/teuthology/trociny-2022-02-12_18:36:33-rbd-wip-mgolub-testing-distro-basic-smithi/6678662/
[3]https://github.com/ceph/ceph/pull/41696

Actions

Copy link

Updated by Mykola Golub about 2 years ago

Digging it further, the problem is not that the rbd-mirror detected the peer change too late. Actually, after detecting it ("handle_remote_pool_meta_updated: remote pool metadata updated unexpectedly") it had enough time to restart the pool replayer before the test aborted, but it did not do this:

When a PoolReplayer detects remote pool metadata change it sets "stopping" flag expecting the Mirror will restart it. Although setting "stopping" flag makes the PoolReplayer::run thread to terminate, the thread's is_started function will still return true until join is called (and reset the thread id). This made impossible for the Mirror to detect (by calling PoolReplayer::is_running) that the PoolReplayer needed restart.

Actions

Copy link