Project

General

Profile

Actions

Bug #63673

closed

qa/workunits/rbd/cli_generic.sh: rbd support module command not failing as expected after module's client is blocklisted

Added by Ramana Raja 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Observed in https://pulpito.ceph.com/yuriw-2023-11-27_17:14:54-rbd-wip-yuri10-testing-2023-11-22-1112-distro-default-smithi/7468177/ , `rbd task add flatten rbd2/clone1` succeeded when it was expected to fail after the rbd_support module's client was blocklisted. The test script line that fails https://github.com/ceph/ceph/blob/v18.2.0/qa/workunits/rbd/cli_generic.sh#L1666

From http://qa-proxy.ceph.com/teuthology/yuriw-2023-11-27_17:14:54-rbd-wip-yuri10-testing-2023-11-22-1112-distro-default-smithi/7468177/teuthology.log ,

2023-11-27T22:15:34.519 INFO:tasks.workunit.client.0.smithi172.stderr:+ CLIENT_ADDR=172.21.15.172:0/3010898941
2023-11-27T22:15:34.519 INFO:tasks.workunit.client.0.smithi172.stderr:+ ceph osd blocklist add 172.21.15.172:0/3010898941
2023-11-27T22:15:34.802 INFO:tasks.workunit.client.0.smithi172.stderr:2023-11-27T22:15:34.801+0000 7fdad3fff640  0 --2- 172.21.15.172:0/3299696924 >> [v2:172.21.15.172:3300/0,v1:172.21.15.172:6789/0] conn(0x7fdad4098860 0x7fdad408f250 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_auth_request get_initial_auth_request returned -2
2023-11-27T22:15:36.941 INFO:tasks.workunit.client.0.smithi172.stderr:blocklisting 172.21.15.172:0/3010898941 until 2023-11-27T23:15:36.054767+0000 (3600 sec)
2023-11-27T22:15:36.955 INFO:tasks.workunit.client.0.smithi172.stderr:+ ceph osd blocklist ls
2023-11-27T22:15:36.955 INFO:tasks.workunit.client.0.smithi172.stderr:+ grep 172.21.15.172:0/3010898941
2023-11-27T22:15:37.341 INFO:tasks.workunit.client.0.smithi172.stderr:listed 10 entries
2023-11-27T22:15:37.352 INFO:tasks.workunit.client.0.smithi172.stdout:172.21.15.172:0/3010898941 2023-11-27T23:15:34.988549+0000
2023-11-27T22:15:37.353 INFO:tasks.workunit.client.0.smithi172.stderr:+ expect_fail ceph rbd task add flatten rbd2/clone1
2023-11-27T22:15:37.353 INFO:tasks.workunit.client.0.smithi172.stderr:+ ceph rbd task add flatten rbd2/clone1
2023-11-27T22:15:38.307 INFO:tasks.workunit.client.0.smithi172.stdout:{"sequence": 1, "id": "6aa3476c-68d9-4821-ad8b-cc5cf9bbb747", "message": "Flattening image rbd2/clone1", "refs": {"action": "flatten", "pool_name": "rbd2", "pool_namespace": "", "image_name": "clone1", "image_id": "33b18161126c"}, "in_progress": true, "progress": 0.1328125}
2023-11-27T22:15:38.334 INFO:tasks.workunit.client.0.smithi172.stderr:+ return 1

Looking at the ceph-mgr log, after the rbd_support module's client was blocklisted, the module started to recover

2023-11-27T22:15:36.979+0000 7f809b2b4640  0 [rbd_support ERROR root] MirrorSnapshotScheduleHandler: client blocklisted
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/rbd_support/mirror_snapshot_schedule.py", line 358, in run
    refresh_delay = self.refresh_images()
  File "/usr/share/ceph/mgr/rbd_support/mirror_snapshot_schedule.py", line 397, in refresh_images
    self.load_schedules()
  File "/usr/share/ceph/mgr/rbd_support/mirror_snapshot_schedule.py", line 387, in load_schedules
    self.schedules.load(namespace_validator, image_validator)
  File "/usr/share/ceph/mgr/rbd_support/schedule.py", line 419, in load
    self.load_from_pool(ioctx, namespace_validator,
  File "/usr/share/ceph/mgr/rbd_support/schedule.py", line 442, in load_from_pool
    ioctx.operate_read_op(read_op, self.handler.SCHEDULE_OID)
  File "rados.pyx", line 3723, in rados.Ioctx.operate_read_op
rados.ConnectionShutdown: [errno 108] RADOS connection was shutdown (Failed to operate read op for oid rbd_mirror_snapshot_schedule)
2023-11-27T22:15:36.979+0000 7f809eafb640  0 [rbd_support INFO root] recovering from blocklisting
2023-11-27T22:15:36.979+0000 7f809eafb640  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: shutting down
2023-11-27T22:15:36.979+0000 7f809eafb640  0 [rbd_support DEBUG root] CreateSnapshotRequests.wait_for_pending: done
2023-11-27T22:15:36.979+0000 7f809eafb640  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: shut down
2023-11-27T22:15:36.979+0000 7f809eafb640  0 [rbd_support INFO root] TrashPurgeScheduleHandler: shutting down
2023-11-27T22:15:36.979+0000 7f809eafb640  0 [rbd_support DEBUG root] TrashPurgeScheduleHandler: joining thread
2023-11-27T22:15:36.983+0000 7f809eafb640  0 [rbd_support INFO root] TrashPurgeScheduleHandler: shut down
2023-11-27T22:15:36.983+0000 7f809eafb640  0 [rbd_support INFO root] TaskHandler: shutting down
2023-11-27T22:15:36.983+0000 7f809eafb640  0 [rbd_support DEBUG root] TaskHandler: joining thread
...
2023-11-27T22:15:37.193+0000 7f80992b0640  0 [rbd_support DEBUG root] TaskHandler: tick
2023-11-27T22:15:37.193+0000 7f809eafb640  0 [rbd_support INFO root] TaskHandler: shut down
2023-11-27T22:15:37.193+0000 7f809eafb640  0 [rbd_support INFO root] PerfHandler: shutting down
2023-11-27T22:15:37.193+0000 7f809eafb640  0 [rbd_support DEBUG root] PerfHandler: joining thread
...
2023-11-27T22:15:37.415+0000 7f809a2b2640  0 [rbd_support DEBUG root] PerfHandler: tick
2023-11-27T22:15:37.416+0000 7f809eafb640  0 [rbd_support INFO root] PerfHandler: shut down

The rbd_support module recovered before the `rbd task flatten` command was received by ceph-mgr

2023-11-27T22:15:37.429+0000 7f809eafb640  0 [rbd_support DEBUG root] TrashPurgeScheduleHandler: no schedules
2023-11-27T22:15:37.429+0000 7f809eafb640  0 [rbd_support DEBUG root] TrashPurgeScheduleHandler: queue is initialized
2023-11-27T22:15:37.429+0000 7f80952a8640  0 [rbd_support INFO root] TrashPurgeScheduleHandler: starting
2023-11-27T22:15:37.429+0000 7f809eafb640  0 [rbd_support INFO root] setup complete

023-11-27T22:15:37.436+0000 7f810cfd8640  1 --2- [v2:172.21.15.172:6824/1773432538,v1:172.21.15.172:6825/1773432538] >> 172.21.15.172:0/1381705749 conn(0x557ab5ce5000 0x557ab69e4680 secure :-1 s=READY pgs=2 cs=0 l=1 rev1=1 crypto rx=0x557ab5adec90 tx=0x557ab6bc7740 comp rx=0 tx=0).ready entity=client.13247 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2023-11-27T22:15:37.641+0000 7f80afb5d640  1 -- [v2:172.21.15.172:6824/1773432538,v1:172.21.15.172:6825/1773432538] <== client.13247 172.21.15.172:0/1381705749 1 ==== mgr_command(tid 0: {"prefix": "rbd task add flatten", "image_spec": "rbd2/clone1", "target": ["mon-mgr", ""]}) v1 ==== 114+0+0 (secure 0 0 0) 0x557ab9c92dc0 con 0x557ab5ce5000
2023-11-27T22:15:37.641+0000 7f80afb5d640 10 mgr.server _handle_command decoded-size=3 prefix=rbd task add flatten
2023-11-27T22:15:37.641+0000 7f80afb5d640 20 is_capable service=py module=rbd_support command=rbd task add flatten write addr - on cap allow *
2023-11-27T22:15:37.641+0000 7f80afb5d640 20  allow so far , doing grant allow *
2023-11-27T22:15:37.641+0000 7f80afb5d640 20  allow all
2023-11-27T22:15:37.641+0000 7f80afb5d640 10 mgr.server _allowed_command  client.admin capable
2023-11-27T22:15:37.641+0000 7f80afb5d640  0 log_channel(audit) log [DBG] : from='client.13247 -' entity='client.admin' cmd=[{"prefix": "rbd task add flatten", "image_spec": "rbd2/clone1", "target": ["mon-mgr", ""]}]: dispatch
2023-11-27T22:15:37.641+0000 7f80afb5d640 10 mgr.server _handle_command passing through command 'rbd task add flatten' size 3
2023-11-27T22:15:37.641+0000 7f809daf9640 10 mgr.server operator() dispatching command 'rbd task add flatten' size 3

test_tasks_recovery() is racy.

    # Fetch and blocklist rbd_support module's RADOS client
    CLIENT_ADDR=$(ceph mgr dump | jq .active_clients[] |
    jq 'select(.name == "rbd_support")' |
    jq -r '[.addrvec[0].addr, "/", .addrvec[0].nonce|tostring] | add')
    ceph osd blocklist add $CLIENT_ADDR
    ceph osd blocklist ls | grep $CLIENT_ADDR

    expect_fail ceph rbd task add flatten rbd2/clone1

It took nearly 3 seconds for `osd blocklist add` command and `osd blocklist ls` command to complete. Before `rbd task add flatten` command reached the mgr, the rbd_support module recovered from blocklisting. Such a race may also occur in test_trash_purge_schedule_recovery(), test_mirror_snapshot_schedule_recovery(), test_perf_image_iostat_recovery() . Maybe removing `osd blocklist ls` command from the tests might help?


Related issues 3 (0 open3 closed)

Copied to rbd - Backport #63714: pacific: qa/workunits/rbd/cli_generic.sh: rbd support module command not failing as expected after module's client is blocklistedResolvedRamana RajaActions
Copied to rbd - Backport #63715: reef: qa/workunits/rbd/cli_generic.sh: rbd support module command not failing as expected after module's client is blocklistedResolvedRamana RajaActions
Copied to rbd - Backport #63716: quincy: qa/workunits/rbd/cli_generic.sh: rbd support module command not failing as expected after module's client is blocklistedResolvedRamana RajaActions
Actions #1

Updated by Ramana Raja 5 months ago

  • Assignee set to Ramana Raja
  • Source set to Q/A
  • Backport set to pacific,quincy,reef
Actions #2

Updated by Ilya Dryomov 5 months ago

  • Status changed from New to In Progress
Actions #3

Updated by Ramana Raja 5 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 54724
Actions #4

Updated by Ilya Dryomov 5 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Backport Bot 5 months ago

  • Copied to Backport #63714: pacific: qa/workunits/rbd/cli_generic.sh: rbd support module command not failing as expected after module's client is blocklisted added
Actions #6

Updated by Backport Bot 5 months ago

  • Copied to Backport #63715: reef: qa/workunits/rbd/cli_generic.sh: rbd support module command not failing as expected after module's client is blocklisted added
Actions #7

Updated by Backport Bot 5 months ago

  • Copied to Backport #63716: quincy: qa/workunits/rbd/cli_generic.sh: rbd support module command not failing as expected after module's client is blocklisted added
Actions #8

Updated by Backport Bot 5 months ago

  • Tags set to backport_processed
Actions #9

Updated by Backport Bot 3 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF