Bug #61925: Mgr daemonserver takes 8 seconds to process pgstats - mgr - Ceph

Actions

Copy link

Bug #61925

closed

Mgr daemonserver takes 8 seconds to process pgstats

Added by qing zhao 11 months ago. Updated 9 months ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

14.2.22

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Theoretically, the daemonserver is executed serially, and should not hold the lock of clusterstat, and the operation should be fast.

version : 14.2.22

Files

slowop-daemonserver.png (70.3 KB) slowop-daemonserver.png

qing zhao, 07/07/2023 01:27 AM

Actions

Copy link

Updated by qing zhao 11 months ago

void ClusterState::ingest_pgstats(MPGStats *stats) {
std::lock_guard l(lock);

const int from = stats->get_orig_source().num();
  bool is_in = with_osdmap([from](const OSDMap& osdmap) {
    return osdmap.is_in(from);
  });
  ...
}
The only possibility is that the operation with_osdmap causes the processing to be very slow?

Actions

Copy link

Updated by Casey Bodley 10 months ago

Project changed from rgw to mgr

Actions

Copy link

Updated by Casey Bodley 10 months ago

version : 14.2.22

please note that ceph version 14.x is no longer supported and won't receive bug fixes: https://docs.ceph.com/en/latest/releases/index.html#archived-releases

do you know if this still happens on recent versions?

Actions

Copy link

Updated by qing zhao 10 months ago

Casey Bodley wrote:

version : 14.2.22

please note that ceph version 14.x is no longer supported and won't receive bug fixes: https://docs.ceph.com/en/latest/releases/index.html#archived-releases

do you know if this still happens on recent versions?

We haven't tested other versions yet

Actions

Copy link

Updated by Radoslaw Zarzynski 10 months ago

Status changed from New to Won't Fix

Closing as Nautilus is EOL. Feel free to reopen if happens on a supported version.

Actions

Copy link

Updated by Oleksandr Mykhalskyi 9 months ago

Dear developers,

Something similar I can observe in our Pacific 16.2.12, with about 800 OSD and 20000 PG

Mgr daemon handles every pg_stats about 8-9 sec

.............. pg_stats message is wating some time due to full queue of throttle-mgr_mon_bytes ...........
2023-09-01T11:25:23.998+0000 7f4562307700 1 --2- [v2:xxxxxx:6800/876123119,v1:xxxxxx:6801/876123119] >> [v2:yyyyyy:6846/4170787396,v1:yyyyyy:6850/4170787396] conn(0x55cba072bc00 0x55cbd5fce300 secure :-1 s=THROTTLE_BYTES pgs=97907 cs=0 l=1 rev1=1 rx=0x55cbc275f620 tx=0x55cc55c28de0).throttle_bytes wants 140834 bytes from policy throttler 2147466527/2147483648 failed, just wait.
2023-09-01T11:25:24.006+0000 7f4562307700 1 --2- [v2:xxxxxx:6800/876123119,v1:xxxxxx:6801/876123119] >> [v2:yyyyyy:6846/4170787396,v1:yyyyyy:6850/4170787396] conn(0x55cba072bc00 0x55cbd5fce300 secure :-1 s=THROTTLE_BYTES pgs=97907 cs=0 l=1 rev1=1 rx=0x55cbc275f620 tx=0x55cc55c28de0).throttle_bytes wants 140834 bytes from policy throttler 2147466527/2147483648 failed, just wait.
2023-09-01T11:25:24.122+0000 7f4562307700 5 --2- [v2:xxxxxx:6800/876123119,v1:xxxxxx:6801/876123119] >> [v2:yyyyyy:6846/4170787396,v1:yyyyyy:6850/4170787396] conn(0x55cba072bc00 0x55cbd5fce300 secure :-1 s=THROTTLE_DONE pgs=97907 cs=0 l=1 rev1=1 rx=0x55cbc275f620 tx=0x55cc55c28de0).handle_message got 140834 + 0 + 0 byte message. envelope type=1794 src osd.217 off 0
2023-09-01T11:25:24.142+0000 7f4562307700 5 --2- [v2:xxxxxx:6800/876123119,v1:xxxxxx:6801/876123119] >> [v2:yyyyyy:6846/4170787396,v1:yyyyyy:6850/4170787396] conn(0x55cba072bc00 0x55cbd5fce300 secure :-1 s=READ_MESSAGE_COMPLETE pgs=97907 cs=0 l=1 rev1=1 rx=0x55cbc275f620 tx=0x55cc55c28de0).handle_message received message m=0x55cbd60e0a80 seq=38 from=osd.217 type=1794 mgrreport(osd.217 +0-0 packed 1350 daemon_metrics=2) v9
2023-09-01T11:25:24.142+0000 7f4562307700 5 --2- [v2:xxxxxx:6800/876123119,v1:xxxxxx:6801/876123119] >> [v2:yyyyyy:6846/4170787396,v1:yyyyyy:6850/4170787396] conn(0x55cba072bc00 0x55cbd5fce300 secure :-1 s=THROTTLE_DONE pgs=97907 cs=0 l=1 rev1=1 rx=0x55cbc275f620 tx=0x55cc55c28de0).handle_message got 60629 + 0 + 0 byte message. envelope type=87 src osd.217 off 0
2023-09-01T11:25:24.142+0000 7f4562307700 5 --2- [v2:xxxxxx:6800/876123119,v1:xxxxxx:6801/876123119] >> [v2:yyyyyy:6846/4170787396,v1:yyyyyy:6850/4170787396] conn(0x55cba072bc00 0x55cbd5fce300 secure :-1 s=READ_MESSAGE_COMPLETE pgs=97907 cs=0 l=1 rev1=1 rx=0x55cbc275f620 tx=0x55cc55c28de0).handle_message received message m=0x55cbbc995200 seq=39 from=osd.217 type=87 pg_stats(75 pgs seq 1498076003062745 v 0) v2
2023-09-01T11:25:28.626+0000 7f443f4b9700 1 -- [v2:xxxxxx:6800/876123119,v1:xxxxxx:6801/876123119] <== osd.217 v2:yyyyyy:6846/4170787396 38 ==== mgrreport(osd.217 +0-0 packed 1350 daemon_metrics=2) v9 ==== 140834+0+0 (secure 0 0 0) 0x55cbd60e0a80 con 0x55cba072bc00
2023-09-01T11:25:33.354+0000 7f443f4b9700 1 -- [v2:xxxxxx:6800/876123119,v1:xxxxxx:6801/876123119] <== osd.217 v2:yyyyyy:6846/4170787396 39 ==== pg_stats(75 pgs seq 1498076003062745 v 0) v2 ==== 60629+0+0 (secure 0 0 0) 0x55cbbc995200 con 0x55cba072bc00

All CLI commands related to MGR were executed very slow (ceph osd df, ceph osd pool stats, ceph orch commands), they could take 1 min or more, due always busy queue of throttle-mgr_mon_messages.
This queue is always full due to long processing of pg_stats.

I tried to increase size of mgr_mon queue and had to increase mgr_stats_period
ceph config set mgr mgr_mon_messages 2048
ceph config set mgr mgr_mon_bytes 2147483648
ceph config set mgr mgr_stats_period 60

CLI commands like ceph osd df are executing now more quickly (up to 10 sec), but cluster/pools I/O stats are absolutely inadequate due to big mgr_stats_period

ms_dispatch thread of mgr daemon is always busy for 100%.

docker exec ceph-360d9414-747a-11ec-9185-d9f3fb3bb55d-mgr-hostname-xulvzb ceph daemon mgr.hostname.xulvzb perf dump

"throttle-mgr_mon_bytes": {
        "val": 2146217635,
        "max": 2147483648,
        "get_started": 0,
        "get": 37690095,
        "get_sum": 57723874246468,
        "get_or_fail_fail": 536371492467,
        "get_or_fail_success": 37690095,
        "take": 0,
        "take_sum": 0,
        "put": 37689739,
        "put_sum": 57721728028833,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    },
    "throttle-mgr_mon_messsages": {
        "val": 2013,
        "max": 2048,
        "get_started": 0,
        "get": 37697121,
        "get_sum": 37697121,
        "get_or_fail_fail": 862229834,
        "get_or_fail_success": 37697121,
        "take": 0,
        "take_sum": 0,
        "put": 37695108,
        "put_sum": 37695108,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }

The same behavior was also reported by other people in https://www.spinics.net/lists/ceph-users/msg73760.html

Thank you.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #61925

Mgr daemonserver takes 8 seconds to process pgstats

Updated by qing zhao 11 months ago

Updated by Casey Bodley 10 months ago

Updated by Casey Bodley 10 months ago

Updated by qing zhao 10 months ago

Updated by Radoslaw Zarzynski 10 months ago

Updated by Oleksandr Mykhalskyi 9 months ago