Project

General

Profile

Bug #56724

Updated by Ramana Raja about 1 year ago

Currently, MirrorSnapshotScheduleHandler thread gets wedged and as a result mirror snapshot scheduling is halted until ceph-mgr daemon is restarted or failed over. 

 The MirrorSnapshotScheduleHandler thread gets wedged when its RADOS client gets blocklisted. This would also happen to TaskHandler, PerfHandler and TrashPurgeScheduleHandler threads since they all use the same RADOS client, rbd_support module's client. To fix this issue, upon the client getting blocklisted, the handlers and the client connection would need to be shutdown, and new handlers threads using a new RADOS client connection would need to be started. 

 The feature https://tracker.ceph.com/issues/58691, stores names of modules along with the client address that are registered in the MgrMap. This will be used in automated tests to easily identify the rbd_support module's client to blocklist and check for the module's recovery. 

 Currently, when a mgr is failed over using `mgr fail` command, the MgrMonitor proposes OSDMap changes where the registered clients of the mgr being failed are added to the blocklist, and MgrMap changes where a standby mgr is set as the active replacing the mgr being failed. The mgr being failed may not see the new MgrMap updates where it's no longer set as active. However, its modules (e.g., rbd_support) may see their clients blocklisted, and recover by registering new clients. Meanwhile, the standby mgr may see the new MgrMap before the mgr being failed does so. Since the standby mgr sees itself as newly active, it starts loading its modules and registering clients. Now, there would be mgr module clients belonging to two different mgrs trying to modify the same resource. This situation would be resolved only when the mgr being failed sees the new MgrMap where it's no longer active and kills itself. To prevent the racing of clients of two mgrs, the plan is to block the registration of the client of the mgr being failed during recovery until the client address shows up in a new MgrMap. This would force the failed Mgr to wait for the new MgrMap where it's no longer active. The work to block the registration of a mgr module's client until the client's address shows up in the MgrMap is tracked here, https://tracker.ceph.com/issues/58924 

Back