Project

General

Profile

Fix #51177

pybind/mgr/volumes: investigate moving calls which may block on libcephfs into another thread

Added by Patrick Donnelly over 1 year ago. Updated 2 months ago.

Status:
Fix Under Review
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
pacific,quincy
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
mgr/volumes
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

To not block the ceph-mgr finisher thread on any calls out to cephfs. This can have disastrous consequences as the mgr will then stop functioning.

(I think it may need changes to ceph-mgr to support this. Namely, the core ceph-mgr code handles the return from the command and creates a reply message. We need that reply message to be constructed within the module and sent out.)

History

#1 Updated by Patrick Donnelly 5 months ago

  • Target version deleted (v17.0.0)

#3 Updated by Venky Shankar 4 months ago

  • Assignee set to Kotresh Hiremath Ravishankar
  • Target version set to v18.0.0
  • Backport changed from pacific to pacific,quincy

Kotresh, please take a look at this.

#4 Updated by Venky Shankar 3 months ago

Spoke to Kotersh today - we may want to introduce an async command execution interface in plugins that the finisher thread would call and "handover" the request to be replied back by the plugin itself. Plugins can choose to implement this execution "mode" or use the default blocking call by the finisher thread.

#5 Updated by Kotresh Hiremath Ravishankar 3 months ago

  • Pull request ID set to 47893

Discussion Summary with Patrick

1. Have a thread for each module to execute module commands. Since the finisher thread infrastructure is already in place, it's better to use one finisher thread per module.
Currently with the draft PR, there is one finisher thread per module and one generic finisher thread via which all other things like config, notify is done. This is different from the
comment 4. Both has it's pros and cons. With this approach, if any command is stuck in a python module, only that module is affected (the subsequent module commands waits) and other
module commands goes through. This is comparatively easy to implement. The comment 4's approach needs change in every python module and the effect of asynchronous nature is to be tested
as all module commands is asynchronous.

2. Add a warning if the finisher thread's queue is growing or if the it takes more than 15 secs for the single command.

3. Add extensive performance counters for osdmap, mdsmap changes, command processed (op time, command count)

#6 Updated by Kotresh Hiremath Ravishankar 3 months ago

  • Status changed from New to In Progress

#7 Updated by Venky Shankar 2 months ago

  • Status changed from In Progress to Fix Under Review

Also available in: Atom PDF