Ceph &raquo; rbd

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

pacific,quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

49742

Crash signature (v1):

Crash signature (v2):

Description

Currently, MirrorSnapshotScheduleHandler thread gets wedged and as a result mirror snapshot scheduling is halted until ceph-mgr daemon is restarted or failed over.

The MirrorSnapshotScheduleHandler thread gets wedged when its RADOS client gets blocklisted. This would also happen to TaskHandler, PerfHandler and TrashPurgeScheduleHandler threads since they share the rbd_support module's RADOS client. To fix this issue, upon the client getting blocklisted, the handlers and the client connection would need to be shutdown, and handlers threads using a new RADOS client connection would need to be started.

The feature https://tracker.ceph.com/issues/58691, stores names of modules along with their client's address in the MgrMap. It helps improve the general debuggability of the mgr service. It will also be used in automated tests to easily identify the rbd_support module's client to blocklist and check for the module's recovery.

When a mgr is failed over using `mgr fail` command, the MgrMonitor proposes OSDMap changes where the registered clients of the mgr being failed are added to the blocklist, and proposes MgrMap changes where a standby mgr is set as active replacing the mgr being failed. The mgr being failed may not see the new MgrMap updates where it's no longer set as active. However, its modules (e.g., rbd_support) may see their clients blocklisted, and recover by registering new clients. Meanwhile, the standby mgr may see the new MgrMap before the mgr being failed does so. Since the standby mgr sees itself as newly active, it starts loading its modules and registering clients. Now, there would be mgr module clients belonging to two different mgrs trying to modify the same resource. This situation would be resolved only when the mgr being failed sees the new MgrMap where it's no longer active and kills itself. To prevent the racing of clients of two mgrs, the plan is to block the registration of the client of the mgr being failed during recovery until the client address shows up in a new MgrMap. This would force the failed Mgr to wait for the new MgrMap where it's no longer active. The work to block the registration of a mgr module's client until the client's address shows up in the MgrMap is tracked here, https://tracker.ceph.com/issues/58924

Recent work tracked by https://tracker.ceph.com/issues/58923, batches the MgrMonitor's proposal of OSDMap and MgrMap updates when dropping the mgr being failed. The batching helps reduce the delay between the blocklisting of the mgr's clients due to the OSDMap updates, and the mgr killing itself on seeing the MgrMap updates where its no longer set as active.

Related issues 8 (2 open — 6 closed)

Related to mgr - Bug #58923: MgrMonitor: batch commit OSDMap and MgrMap mutations

Resolved

Patrick Donnelly

Related to mgr - Bug #58924: mgr: block register_client on new MgrMap

Fix Under Review

Related to mgr - Bug #58691: store names of modules that register RADOS clients in the MgrMap

Resolved

Related to rbd - Bug #59681: [rbd_support] improve cli_generic.sh tests for recovery from blocklisting

New

Related to rbd - Bug #59713: [rbd_support] recover from "double blocklisting" (being blocklisted while recovering from blocklisting)

Resolved

Related to rbd - Bug #62994: mgr/rbd_support: recovery from client blocklisting halts after MirrorSnapshotScheduleHandler tries to terminate its run thread

Resolved

Copied to rbd - Backport #59711: quincy: [rbd_support] recover from RADOS instance blocklisting

Resolved

Copied to rbd - Backport #59712: pacific: [rbd_support] recover from RADOS instance blocklisting

Resolved

Updated by Ilya Dryomov over 1 year ago

Status changed from In Progress to New
Assignee deleted (~~Ilya Dryomov~~)

Actions

Updated by Ramana Raja over 1 year ago

Assignee set to Ramana Raja

Actions

Updated by Ramana Raja over 1 year ago

Status changed from New to In Progress

Actions

Updated by Ramana Raja over 1 year ago

Pull request ID set to 49742

Actions

Updated by Ramana Raja about 1 year ago

Status changed from In Progress to Fix Under Review

This tracker depends on https://tracker.ceph.com/issues/58923 and https://tracker.ceph.com/issues/58924

Actions

Updated by Ramana Raja about 1 year ago

Related to Bug #58923: MgrMonitor: batch commit OSDMap and MgrMap mutations added

Actions

Updated by Ramana Raja about 1 year ago

Related to Bug #58924: mgr: block register_client on new MgrMap added

Actions

Updated by Ilya Dryomov about 1 year ago

Related to Bug #58691: store names of modules that register RADOS clients in the MgrMap added

Actions

Updated by Ramana Raja 12 months ago

Description updated (diff)

Actions

#10

Updated by Ramana Raja 12 months ago

Description updated (diff)

Actions

#11

Updated by Ramana Raja 12 months ago

Description updated (diff)

Actions

#12

Updated by Ilya Dryomov 12 months ago

Status changed from Fix Under Review to Pending Backport
Backport set to pacific,quincy

Actions

#13

Updated by Backport Bot 12 months ago

Copied to Backport #59711: quincy: [rbd_support] recover from RADOS instance blocklisting added

Actions

#14

Updated by Backport Bot 12 months ago

Copied to Backport #59712: pacific: [rbd_support] recover from RADOS instance blocklisting added

Actions

#15

Updated by Backport Bot 12 months ago

Tags set to backport_processed

Actions

#16

Updated by Ilya Dryomov 12 months ago

Related to Bug #59681: [rbd_support] improve cli_generic.sh tests for recovery from blocklisting added

Actions

#17

Updated by Ilya Dryomov 12 months ago

Related to Bug #59713: [rbd_support] recover from "double blocklisting" (being blocklisted while recovering from blocklisting) added

Actions

#18

Updated by Ramana Raja 7 months ago

Related to Bug #62994: mgr/rbd_support: recovery from client blocklisting halts after MirrorSnapshotScheduleHandler tries to terminate its run thread added

Actions