Bug #48052
Monitor commands randomly hang for "osd df", "osd perf" or "pg dump pgs_brief" in nautilus
0%
Description
We use the official go-ceph binding to talk to Ceph to get running metrics. We have found that sending osd df
, osd perf
or pg dump pgs_brief
in nautilus via the Ceph monitor commands would randomly result in the command being stuck. When it happened, the following could be observed in the monitor log, for example:
# grep 'slow ops' /var/log/ceph/ceph-mon.stage2-mon01-object01.log | tail 2020-10-28 16:00:39.437 7fa26bbdf700 -1 mon.stage2-mon01-object01@0(leader) e3 get_health_metrics reporting 43 slow ops, oldest is mon_command({"format":"json","prefix":"osd df"} v 0) 2020-10-28 16:00:44.445 7fa26bbdf700 -1 mon.stage2-mon01-object01@0(leader) e3 get_health_metrics reporting 43 slow ops, oldest is mon_command({"format":"json","prefix":"osd df"} v 0)
To clear those slow ops reporting, we have to restart all Ceph managers.
We realized that osd df
, osd perf
and pg dump pgs_brief
belong to Ceph manager command; however, in luminous releases, sending them via Ceph monitor commands works perfectly. So, we believe there might be a regression in Mgr/Mon that caused this hang issue.
History
#1 Updated by Neha Ojha about 3 years ago
- Project changed from Ceph to RADOS
- Category deleted (
Monitor)
#2 Updated by Neha Ojha about 3 years ago
- Project changed from RADOS to mgr
Have you checked the CPU utilization on the mgr when these commands hang? Which mgr modules are enabled? It is possible that these commands hang because the mgr is overloaded, we have seen other instances of this.
#3 Updated by Joshua Baergen about 3 years ago
CPU util on the nodes looks fine over the period (checked two systems).
Modules:
"always_on_modules": [ "balancer", "crash", "devicehealth", "orchestrator_cli", "progress", "rbd_support", "status", "volumes" ], "enabled_modules": [ "dashboard", "restful" ],
Note that we haven't had any issues since sending the commands directly to the mgrs rather than to the mons. Perhaps that implies that there's some sort of forwarding issue between the mgrs and mons? Also, IIRC during upgrades we saw this symptom start as soon as we enabled msgr2.
#4 Updated by Oleksandr Mykhalskyi 3 months ago
The same behavior we have in Pacific 16.2.12, it can be related to very slow processing of PGstats by mgr daemon and overloaded queue throttle-mgr_mon_messages, look please at my comment in other ticket https://tracker.ceph.com/issues/61925#note-6