Project

General

Profile

Bug #43364

ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive

Added by Eric Petit about 1 month ago. Updated 11 days ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous, mimic, nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

After upgrading from Luminous to Nautilus, I noticed that ceph-mgr would become partly unresponsive on larger clusters. Htop revealed that a 'mgr-fin' thread was permanently using 100% CPU on the active manager, and a `perf dump` on the admin socket showed this:

    "finisher-Mgr": {
        "queue_len": 1225609,
        "complete_latency": {
            "avgcount": 645,
            "sum": 0.630415978,
            "avgtime": 0.000977389
        }
    },

The origin of the problem appears to be digest messages from monitors: every time they come in, ceph-mgr triggers notify_all for mon_status/health/pg_summary towards all Python modules and that processing gets serialized into the finisher thread. When notifications come in faster than Python is able to process (likely because pg_summary is rather big), the queue can grow indefinitely, modules have outdated information, and any command towards them (which also get piped through that thread) would hang.

Increasing mon_mgr_digest_period (from the default 5 seconds) helps mitigate the problem. On the largest cluster, setting it to 15 stopped the queue from growing and looking at htop again, I see the finisher thread only spike to 100% CPU for ~7 seconds every 15 seconds. However some events can still overwhelm it (e.g. changes in health checks trigger additional digests, which can happen in a number of cases - upmap balancer running, rolling restart for upgrades, flapping OSD, etc).

I am running Nautilus 14.2.5 on Debian 9 (using Ubuntu Xenial packages), with the largest cluster being 1440 OSDs with 8192 erasure 10+2 PGs.

Possibly related to
https://tracker.ceph.com/issues/43317
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GG5J4ERTCQRA2QSA6NN6VX3YLUIVEJ3W/#PKI3O2YZE2IJF3KJBBAL3YKYRYWJ4D5O

mgr-callgraph.txt View (131 KB) Eric Petit, 12/20/2019 06:51 AM

mgr.gdbpmp (233 KB) Eric Petit, 12/20/2019 06:51 AM

mgr-14.2.5-1-g4ba16f6-1xenial.gdbpmp (517 KB) Eric Petit, 12/23/2019 09:22 AM

mgr-14.2.5-1-g4ba16f6-1xenial.callgraph.txt View (118 KB) Eric Petit, 12/23/2019 09:22 AM


Related issues

Copied to mgr - Backport #43466: nautilus: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive Resolved
Copied to mgr - Backport #43467: mimic: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive New
Copied to mgr - Backport #43468: luminous: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive New

History

#1 Updated by Bryan Stillwell about 1 month ago

I'm also seeing the 'mgr-fin' thread using 100% of a CPU core. However, I don't see a high queue_len, so maybe ~350 OSDs is the point where the mgr just barely keeps up.

#2 Updated by Neha Ojha about 1 month ago

  • Priority changed from Normal to Urgent

#3 Updated by Neha Ojha about 1 month ago

  • Status changed from New to Need More Info

We added network ping time monitoring in 14.2.5 - https://github.com/ceph/ceph/pull/30195.
We'd like to verify if this PR is the root cause of the problem because this feature added extra heartbeat information to the osd_stat, which might attribute to extra processing needed to be done by the mgr.

1. Run "ceph daemon mgr.? dump_osd_network 0" - should include network information like https://github.com/ceph/ceph/pull/28755#issuecomment-517071501
2. Set osd_mon_heartbeat_stat_stale to 1, something like "ceph config set osd osd_mon_heartbeat_stat_stale 1"
3. Run "ceph daemon mgr.? dump_osd_network 0" to verify that the network information is gone and check if the cpu usage goes down

Doing this will essentially lower the threshold of marking heartbeat information stale, in turn removing the heartbeat information from the osd_stat.

#4 Updated by Paul Emmerich 30 days ago

perf top:

   6.13%  libstdc++.so.6.0.25                [.] __dynamic_cast
   5.75%  libstdc++.so.6.0.25                [.] std::__ostream_insert<char, std::char_traits<char> >
   2.75%  libstdc++.so.6.0.25                [.] std::locale::locale
   2.73%  libstdc++.so.6.0.25                [.] std::basic_streambuf<char, std::char_traits<char> >::xsputn
   2.32%  libstdc++.so.6.0.25                [.] std::locale::~locale
   2.07%  libstdc++.so.6.0.25                [.] std::locale::operator=
   2.04%  libpython2.7.so.1.0                [.] PyDict_Next
   1.86%  libstdc++.so.6.0.25                [.] std::ostream::sentry::sentry
   1.84%  libpython2.7.so.1.0                [.] 0x00000000000d39a0
   1.78%  libc-2.28.so                       [.] vfprintf
   1.77%  libstdc++.so.6.0.25                [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast
   1.72%  libpython2.7.so.1.0                [.] 0x00000000000d3910
   1.71%  libstdc++.so.6.0.25                [.] __cxxabiv1::__si_class_type_info::__do_dyncast
   1.47%  libpython2.7.so.1.0                [.] PyObject_Malloc
   1.46%  libstdc++.so.6.0.25                [.] std::ostream::sentry::~sentry
   1.36%  ceph-mgr                           [.] std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace
   1.16%  libceph-common.so.0                [.] ceph::fixed_u_to_string[abi:cxx11]
   1.01%  libstdc++.so.6.0.25                [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>
   0.90%  libc-2.28.so                       [.] memcpy

the dump_osd_network commands results in 33 megabytes of json

setting `osd_mon_heartbeat_stat_stale` to 1 causes it to go down to by ~30% after a few minutes but the CPU load is unchanged.

#5 Updated by Eric Petit 30 days ago

Attaching gdbpmp profile, osd_stat_t::dump appears to be the hotspot

I tried the heartbeat change:

# Before
% time ceph daemon mgr.xxx dump_osd_network 0 | wc
7346067 15330919 195250781
real    0m7.886s
# After setting osd_mon_heartbeat_stat_stale=1 and restarting ceph-mgr
% time ceph daemon mgr.xxx dump_osd_network 0 | wc
      4       6      42
real    0m0.218s

Unfortunately it did not reduce the CPU usage - I still it spiking to 100% for about 7 seconds every mon_mgr_digest_period

#6 Updated by Neha Ojha 29 days ago

Unfortunately, setting osd_mon_heartbeat_stat_stale=1 does not completely disable the network ping time feature. The way the code is written, it removes one entry at a time on every heartbeat interval, so there could still be new entries being added while we are marking some stale.

Looking at the wallclock profiler details we've received, dump_osd_stats and fixed_u_to_string consuming a lot of CPU still points to the newtork ping time feature since fixed_u_to_string only got introduced in this feature and the only new stuff we added to osd_stat_t was due to the same feature.

We are currently building a patch based off of 14.2.5, which disables sending the network ping times related stats to the mgr in dump_osd_stats. Once it builds alright, it'd be great if someone could give it a try.

https://github.com/ceph/ceph/pull/32386
Repo: https://shaman.ceph.com/repos/ceph/wip-nautilus-ceph-mgr-perf-test/4ba16f61da809848631e157344676d8bf8eaba95/

#7 Updated by Eric Petit 27 days ago

Thank you for the patch,

I have tried the test build, but I'm afraid I did not see a reduction in CPU usage and the profiler data (attached) remained similar. Looking at the PR and the callgraph, the processing seems to be triggered not when dump_osd_stats is hit directly, it rather goes PGMap::dump -> PGMap::dump_osd_stats but the network information would still be sent in that case?

#9 Updated by Eric Petit 26 days ago

Yep, that did it - the CPU spikes are much shorter (less than 1 sec) with the last patch, the processing queue isn't growing anymore and mgr modules in general (dashboard, prometheus...) are much more responsive

#10 Updated by David Zafman 22 days ago

  • Status changed from Need More Info to Pending Backport
  • Backport set to luminous, mimic, nautilus

#11 Updated by Nathan Cutler 16 days ago

David, are you going to do the backport? If not, can you briefly write here how to proceed?

#13 Updated by Nathan Cutler 12 days ago

  • Copied to Backport #43466: nautilus: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

#14 Updated by Nathan Cutler 12 days ago

  • Copied to Backport #43467: mimic: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

#15 Updated by Nathan Cutler 12 days ago

  • Copied to Backport #43468: luminous: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

#16 Updated by Nathan Cutler 12 days ago

  • Pull request ID set to 32406

#17 Updated by Rafal Wadolowski 11 days ago

I saw mgr (v12.2.4) crushes on our big cluster(4200+ OSDs). It looks relative to that issue. Ours output of perf dump:

"finisher-Mgr": {
        "queue_len": 0,
        "complete_latency": {
            "avgcount": 13491,
            "sum": 49.715575396,
            "avgtime": 0.003685091
        }
    },

It looks horrible.

Also available in: Atom PDF