Project

General

Profile

Bug #43364

ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive

Added by Eric Petit over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous, mimic, nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading from Luminous to Nautilus, I noticed that ceph-mgr would become partly unresponsive on larger clusters. Htop revealed that a 'mgr-fin' thread was permanently using 100% CPU on the active manager, and a `perf dump` on the admin socket showed this:

    "finisher-Mgr": {
        "queue_len": 1225609,
        "complete_latency": {
            "avgcount": 645,
            "sum": 0.630415978,
            "avgtime": 0.000977389
        }
    },

The origin of the problem appears to be digest messages from monitors: every time they come in, ceph-mgr triggers notify_all for mon_status/health/pg_summary towards all Python modules and that processing gets serialized into the finisher thread. When notifications come in faster than Python is able to process (likely because pg_summary is rather big), the queue can grow indefinitely, modules have outdated information, and any command towards them (which also get piped through that thread) would hang.

Increasing mon_mgr_digest_period (from the default 5 seconds) helps mitigate the problem. On the largest cluster, setting it to 15 stopped the queue from growing and looking at htop again, I see the finisher thread only spike to 100% CPU for ~7 seconds every 15 seconds. However some events can still overwhelm it (e.g. changes in health checks trigger additional digests, which can happen in a number of cases - upmap balancer running, rolling restart for upgrades, flapping OSD, etc).

I am running Nautilus 14.2.5 on Debian 9 (using Ubuntu Xenial packages), with the largest cluster being 1440 OSDs with 8192 erasure 10+2 PGs.

Possibly related to
https://tracker.ceph.com/issues/43317
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GG5J4ERTCQRA2QSA6NN6VX3YLUIVEJ3W/#PKI3O2YZE2IJF3KJBBAL3YKYRYWJ4D5O

mgr-callgraph.txt View (131 KB) Eric Petit, 12/20/2019 06:51 AM

mgr.gdbpmp (233 KB) Eric Petit, 12/20/2019 06:51 AM

mgr-14.2.5-1-g4ba16f6-1xenial.gdbpmp (517 KB) Eric Petit, 12/23/2019 09:22 AM

mgr-14.2.5-1-g4ba16f6-1xenial.callgraph.txt View (118 KB) Eric Petit, 12/23/2019 09:22 AM


Related issues

Related to mgr - Bug #39264: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors Resolved
Duplicated by mgr - Bug #43317: high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5 Duplicate
Copied to mgr - Backport #43466: nautilus: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive Resolved
Copied to mgr - Backport #43467: mimic: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive Rejected
Copied to mgr - Backport #43468: luminous: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive Rejected

History

#1 Updated by Bryan Stillwell over 4 years ago

I'm also seeing the 'mgr-fin' thread using 100% of a CPU core. However, I don't see a high queue_len, so maybe ~350 OSDs is the point where the mgr just barely keeps up.

#2 Updated by Neha Ojha over 4 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Neha Ojha over 4 years ago

  • Status changed from New to Need More Info

We added network ping time monitoring in 14.2.5 - https://github.com/ceph/ceph/pull/30195.
We'd like to verify if this PR is the root cause of the problem because this feature added extra heartbeat information to the osd_stat, which might attribute to extra processing needed to be done by the mgr.

1. Run "ceph daemon mgr.? dump_osd_network 0" - should include network information like https://github.com/ceph/ceph/pull/28755#issuecomment-517071501
2. Set osd_mon_heartbeat_stat_stale to 1, something like "ceph config set osd osd_mon_heartbeat_stat_stale 1"
3. Run "ceph daemon mgr.? dump_osd_network 0" to verify that the network information is gone and check if the cpu usage goes down

Doing this will essentially lower the threshold of marking heartbeat information stale, in turn removing the heartbeat information from the osd_stat.

#4 Updated by Paul Emmerich over 4 years ago

perf top:

   6.13%  libstdc++.so.6.0.25                [.] __dynamic_cast
   5.75%  libstdc++.so.6.0.25                [.] std::__ostream_insert<char, std::char_traits<char> >
   2.75%  libstdc++.so.6.0.25                [.] std::locale::locale
   2.73%  libstdc++.so.6.0.25                [.] std::basic_streambuf<char, std::char_traits<char> >::xsputn
   2.32%  libstdc++.so.6.0.25                [.] std::locale::~locale
   2.07%  libstdc++.so.6.0.25                [.] std::locale::operator=
   2.04%  libpython2.7.so.1.0                [.] PyDict_Next
   1.86%  libstdc++.so.6.0.25                [.] std::ostream::sentry::sentry
   1.84%  libpython2.7.so.1.0                [.] 0x00000000000d39a0
   1.78%  libc-2.28.so                       [.] vfprintf
   1.77%  libstdc++.so.6.0.25                [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast
   1.72%  libpython2.7.so.1.0                [.] 0x00000000000d3910
   1.71%  libstdc++.so.6.0.25                [.] __cxxabiv1::__si_class_type_info::__do_dyncast
   1.47%  libpython2.7.so.1.0                [.] PyObject_Malloc
   1.46%  libstdc++.so.6.0.25                [.] std::ostream::sentry::~sentry
   1.36%  ceph-mgr                           [.] std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace
   1.16%  libceph-common.so.0                [.] ceph::fixed_u_to_string[abi:cxx11]
   1.01%  libstdc++.so.6.0.25                [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>
   0.90%  libc-2.28.so                       [.] memcpy

the dump_osd_network commands results in 33 megabytes of json

setting `osd_mon_heartbeat_stat_stale` to 1 causes it to go down to by ~30% after a few minutes but the CPU load is unchanged.

#5 Updated by Eric Petit over 4 years ago

Attaching gdbpmp profile, osd_stat_t::dump appears to be the hotspot

I tried the heartbeat change:

# Before
% time ceph daemon mgr.xxx dump_osd_network 0 | wc
7346067 15330919 195250781
real    0m7.886s
# After setting osd_mon_heartbeat_stat_stale=1 and restarting ceph-mgr
% time ceph daemon mgr.xxx dump_osd_network 0 | wc
      4       6      42
real    0m0.218s

Unfortunately it did not reduce the CPU usage - I still it spiking to 100% for about 7 seconds every mon_mgr_digest_period

#6 Updated by Neha Ojha over 4 years ago

Unfortunately, setting osd_mon_heartbeat_stat_stale=1 does not completely disable the network ping time feature. The way the code is written, it removes one entry at a time on every heartbeat interval, so there could still be new entries being added while we are marking some stale.

Looking at the wallclock profiler details we've received, dump_osd_stats and fixed_u_to_string consuming a lot of CPU still points to the newtork ping time feature since fixed_u_to_string only got introduced in this feature and the only new stuff we added to osd_stat_t was due to the same feature.

We are currently building a patch based off of 14.2.5, which disables sending the network ping times related stats to the mgr in dump_osd_stats. Once it builds alright, it'd be great if someone could give it a try.

https://github.com/ceph/ceph/pull/32386
Repo: https://shaman.ceph.com/repos/ceph/wip-nautilus-ceph-mgr-perf-test/4ba16f61da809848631e157344676d8bf8eaba95/

#7 Updated by Eric Petit over 4 years ago

Thank you for the patch,

I have tried the test build, but I'm afraid I did not see a reduction in CPU usage and the profiler data (attached) remained similar. Looking at the PR and the callgraph, the processing seems to be triggered not when dump_osd_stats is hit directly, it rather goes PGMap::dump -> PGMap::dump_osd_stats but the network information would still be sent in that case?

#9 Updated by Eric Petit over 4 years ago

Yep, that did it - the CPU spikes are much shorter (less than 1 sec) with the last patch, the processing queue isn't growing anymore and mgr modules in general (dashboard, prometheus...) are much more responsive

#10 Updated by David Zafman over 4 years ago

  • Status changed from Need More Info to Pending Backport
  • Backport set to luminous, mimic, nautilus

#11 Updated by Nathan Cutler about 4 years ago

David, are you going to do the backport? If not, can you briefly write here how to proceed?

#13 Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #43466: nautilus: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

#14 Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #43467: mimic: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

#15 Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #43468: luminous: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

#16 Updated by Nathan Cutler about 4 years ago

  • Pull request ID set to 32406

#17 Updated by Rafal Wadolowski about 4 years ago

I saw mgr (v12.2.4) crushes on our big cluster(4200+ OSDs). It looks relative to that issue. Ours output of perf dump:

"finisher-Mgr": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 13491,
            "sum": 49.715575396,
            "avgtime": 0.003685091
        }
    },

It looks horrible.

#18 Updated by Nathan Cutler almost 4 years ago

  • Duplicated by Bug #43317: high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5 added

#19 Updated by Lenz Grimmer almost 4 years ago

  • Related to Bug #39264: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors added

#20 Updated by David Zafman over 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF