Project

General

Profile

Actions

Bug #48052

open

Monitor commands randomly hang for "osd df", "osd perf" or "pg dump pgs_brief" in nautilus

Added by Yue Zhu over 3 years ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We use the official go-ceph binding to talk to Ceph to get running metrics. We have found that sending osd df, osd perf or pg dump pgs_brief in nautilus via the Ceph monitor commands would randomly result in the command being stuck. When it happened, the following could be observed in the monitor log, for example:

# grep 'slow ops' /var/log/ceph/ceph-mon.stage2-mon01-object01.log | tail
2020-10-28 16:00:39.437 7fa26bbdf700 -1 mon.stage2-mon01-object01@0(leader) e3 get_health_metrics reporting 43 slow ops, oldest is mon_command({"format":"json","prefix":"osd df"} v 0)
2020-10-28 16:00:44.445 7fa26bbdf700 -1 mon.stage2-mon01-object01@0(leader) e3 get_health_metrics reporting 43 slow ops, oldest is mon_command({"format":"json","prefix":"osd df"} v 0)

To clear those slow ops reporting, we have to restart all Ceph managers.

We realized that osd df, osd perf and pg dump pgs_brief belong to Ceph manager command; however, in luminous releases, sending them via Ceph monitor commands works perfectly. So, we believe there might be a regression in Mgr/Mon that caused this hang issue.

Actions #1

Updated by Neha Ojha over 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)
Actions #2

Updated by Neha Ojha over 3 years ago

  • Project changed from RADOS to mgr

Have you checked the CPU utilization on the mgr when these commands hang? Which mgr modules are enabled? It is possible that these commands hang because the mgr is overloaded, we have seen other instances of this.

Actions #3

Updated by Joshua Baergen over 3 years ago

CPU util on the nodes looks fine over the period (checked two systems).

Modules:

    "always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator_cli",
        "progress",
        "rbd_support",
        "status",
        "volumes" 
    ],
    "enabled_modules": [
        "dashboard",
        "restful" 
    ],

Note that we haven't had any issues since sending the commands directly to the mgrs rather than to the mons. Perhaps that implies that there's some sort of forwarding issue between the mgrs and mons? Also, IIRC during upgrades we saw this symptom start as soon as we enabled msgr2.

Actions #4

Updated by Oleksandr Mykhalskyi 8 months ago

The same behavior we have in Pacific 16.2.12, it can be related to very slow processing of PGstats by mgr daemon and overloaded queue throttle-mgr_mon_messages, look please at my comment in other ticket https://tracker.ceph.com/issues/61925#note-6

Actions

Also available in: Atom PDF