Project

General

Profile

Actions

Bug #56724

closed

[rbd_support] recover from RADOS instance blocklisting

Added by Ilya Dryomov over 1 year ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Currently, MirrorSnapshotScheduleHandler thread gets wedged and as a result mirror snapshot scheduling is halted until ceph-mgr daemon is restarted or failed over.

The MirrorSnapshotScheduleHandler thread gets wedged when its RADOS client gets blocklisted. This would also happen to TaskHandler, PerfHandler and TrashPurgeScheduleHandler threads since they share the rbd_support module's RADOS client. To fix this issue, upon the client getting blocklisted, the handlers and the client connection would need to be shutdown, and handlers threads using a new RADOS client connection would need to be started.

The feature https://tracker.ceph.com/issues/58691, stores names of modules along with their client's address in the MgrMap. It helps improve the general debuggability of the mgr service. It will also be used in automated tests to easily identify the rbd_support module's client to blocklist and check for the module's recovery.

When a mgr is failed over using `mgr fail` command, the MgrMonitor proposes OSDMap changes where the registered clients of the mgr being failed are added to the blocklist, and proposes MgrMap changes where a standby mgr is set as active replacing the mgr being failed. The mgr being failed may not see the new MgrMap updates where it's no longer set as active. However, its modules (e.g., rbd_support) may see their clients blocklisted, and recover by registering new clients. Meanwhile, the standby mgr may see the new MgrMap before the mgr being failed does so. Since the standby mgr sees itself as newly active, it starts loading its modules and registering clients. Now, there would be mgr module clients belonging to two different mgrs trying to modify the same resource. This situation would be resolved only when the mgr being failed sees the new MgrMap where it's no longer active and kills itself. To prevent the racing of clients of two mgrs, the plan is to block the registration of the client of the mgr being failed during recovery until the client address shows up in a new MgrMap. This would force the failed Mgr to wait for the new MgrMap where it's no longer active. The work to block the registration of a mgr module's client until the client's address shows up in the MgrMap is tracked here, https://tracker.ceph.com/issues/58924

Recent work tracked by https://tracker.ceph.com/issues/58923, batches the MgrMonitor's proposal of OSDMap and MgrMap updates when dropping the mgr being failed. The batching helps reduce the delay between the blocklisting of the mgr's clients due to the OSDMap updates, and the mgr killing itself on seeing the MgrMap updates where its no longer set as active.


Related issues 8 (2 open6 closed)

Related to mgr - Bug #58923: MgrMonitor: batch commit OSDMap and MgrMap mutationsResolvedPatrick Donnelly

Actions
Related to mgr - Bug #58924: mgr: block register_client on new MgrMapFix Under ReviewRamana Raja

Actions
Related to mgr - Bug #58691: store names of modules that register RADOS clients in the MgrMapResolvedRamana Raja

Actions
Related to rbd - Bug #59681: [rbd_support] improve cli_generic.sh tests for recovery from blocklistingNewRamana Raja

Actions
Related to rbd - Bug #59713: [rbd_support] recover from "double blocklisting" (being blocklisted while recovering from blocklisting)ResolvedRamana Raja

Actions
Related to rbd - Bug #62994: mgr/rbd_support: recovery from client blocklisting halts after MirrorSnapshotScheduleHandler tries to terminate its run threadResolvedRamana Raja

Actions
Copied to rbd - Backport #59711: quincy: [rbd_support] recover from RADOS instance blocklistingResolvedRamana RajaActions
Copied to rbd - Backport #59712: pacific: [rbd_support] recover from RADOS instance blocklistingResolvedRamana RajaActions
Actions #1

Updated by Ilya Dryomov over 1 year ago

  • Status changed from In Progress to New
  • Assignee deleted (Ilya Dryomov)
Actions #2

Updated by Ramana Raja over 1 year ago

  • Assignee set to Ramana Raja
Actions #3

Updated by Ramana Raja over 1 year ago

  • Status changed from New to In Progress
Actions #4

Updated by Ramana Raja over 1 year ago

  • Pull request ID set to 49742
Actions #5

Updated by Ramana Raja about 1 year ago

  • Status changed from In Progress to Fix Under Review
Actions #6

Updated by Ramana Raja about 1 year ago

  • Related to Bug #58923: MgrMonitor: batch commit OSDMap and MgrMap mutations added
Actions #7

Updated by Ramana Raja about 1 year ago

  • Related to Bug #58924: mgr: block register_client on new MgrMap added
Actions #8

Updated by Ilya Dryomov about 1 year ago

  • Related to Bug #58691: store names of modules that register RADOS clients in the MgrMap added
Actions #9

Updated by Ramana Raja 12 months ago

  • Description updated (diff)
Actions #10

Updated by Ramana Raja 12 months ago

  • Description updated (diff)
Actions #11

Updated by Ramana Raja 12 months ago

  • Description updated (diff)
Actions #12

Updated by Ilya Dryomov 12 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to pacific,quincy
Actions #13

Updated by Backport Bot 12 months ago

  • Copied to Backport #59711: quincy: [rbd_support] recover from RADOS instance blocklisting added
Actions #14

Updated by Backport Bot 12 months ago

  • Copied to Backport #59712: pacific: [rbd_support] recover from RADOS instance blocklisting added
Actions #15

Updated by Backport Bot 12 months ago

  • Tags set to backport_processed
Actions #16

Updated by Ilya Dryomov 12 months ago

  • Related to Bug #59681: [rbd_support] improve cli_generic.sh tests for recovery from blocklisting added
Actions #17

Updated by Ilya Dryomov 12 months ago

  • Related to Bug #59713: [rbd_support] recover from "double blocklisting" (being blocklisted while recovering from blocklisting) added
Actions #18

Updated by Ramana Raja 7 months ago

  • Related to Bug #62994: mgr/rbd_support: recovery from client blocklisting halts after MirrorSnapshotScheduleHandler tries to terminate its run thread added
Actions #19

Updated by Ilya Dryomov 6 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF