Project

General

Profile

Actions

Bug #62994

closed

mgr/rbd_support: recovery from client blocklisting halts after MirrorSnapshotScheduleHandler tries to terminate its run thread

Added by Ramana Raja 8 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ran the integration test in https://github.com/ceph/ceph/pull/53535 that repeatedly blocklists the rbd_support module's RADOS client approximately every 10 seconds after the module recovers from previous blocklisting at http://pulpito.front.sepia.ceph.com/rraja-2023-09-23_06:37:41-rbd:cli-wip-62891-distro-default-smithi/ . Observed 2 job failures,
- http://pulpito.front.sepia.ceph.com/rraja-2023-09-23_06:37:41-rbd:cli-wip-62891-distro-default-smithi/7401648/
- http://pulpito.front.sepia.ceph.com/rraja-2023-09-23_06:37:41-rbd:cli-wip-62891-distro-default-smithi/7401660/
where the rbd_support module didn't recover from blocklisting due to the following issue. The module's MirrorSnapshotScheduleHandler got stuck trying to wait for its run thread to terminate in its shutdown() method.
Excerpt from the mgr log at /a/rraja-2023-09-23_06:37:41-rbd:cli-wip-62891-distro-default-smithi/7401648/remote/smithi099/log/ceph-mgr.x.log.gz in teuthology.

2023-09-23T07:18:32.518+0000 7fe0ea9fe640  0 [rbd_support ERROR root] TrashPurgeScheduleHandler: client blocklisted
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/rbd_support/trash_purge_schedule.py", line 46, in run
    refresh_delay = self.refresh_pools()
  File "/usr/share/ceph/mgr/rbd_support/trash_purge_schedule.py", line 95, in refresh_pools
    self.load_schedules()
  File "/usr/share/ceph/mgr/rbd_support/trash_purge_schedule.py", line 85, in load_schedules
    self.schedules.load()
  File "/usr/share/ceph/mgr/rbd_support/schedule.py", line 419, in load
    self.load_from_pool(ioctx, namespace_validator,
  File "/usr/share/ceph/mgr/rbd_support/schedule.py", line 442, in load_from_pool
    ioctx.operate_read_op(read_op, self.handler.SCHEDULE_OID)
  File "rados.pyx", line 3723, in rados.Ioctx.operate_read_op
rados.ConnectionShutdown: [errno 108] RADOS connection was shutdown (Failed to operate read op for oid rbd_trash_purge_schedule)
2023-09-23T07:18:32.518+0000 7fe0efa08640  0 [rbd_support INFO root] recovering from blocklisting
2023-09-23T07:18:32.518+0000 7fe0efa08640  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: shutting down
2023-09-23T07:18:32.522+0000 7fe0efa08640  0 [rbd_support DEBUG root] MirrorSnapshotScheduleHandler: joining thread

After this I don't see any logs from MirrorSnapshotScheduleHandler and TrashPurgeScheduleHandler. I only see ticks from PerfHandler and TaskHandler.


Related issues 5 (0 open5 closed)

Related to rbd - Bug #56724: [rbd_support] recover from RADOS instance blocklistingResolvedRamana Raja

Actions
Related to rbd - Bug #62891: [test][rbd] test recovery of rbd_support module from repeated blocklisting of its clientResolvedRamana Raja

Actions
Copied to rbd - Backport #63382: pacific: mgr/rbd_support: recovery from client blocklisting halts after MirrorSnapshotScheduleHandler tries to terminate its run threadResolvedRamana RajaActions
Copied to rbd - Backport #63383: quincy: mgr/rbd_support: recovery from client blocklisting halts after MirrorSnapshotScheduleHandler tries to terminate its run threadResolvedRamana RajaActions
Copied to rbd - Backport #63384: reef: mgr/rbd_support: recovery from client blocklisting halts after MirrorSnapshotScheduleHandler tries to terminate its run threadResolvedRamana RajaActions
Actions

Also available in: Atom PDF