Bug #43317
high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5
0%
Description
Since upgrading to Nautilus v14.2.5, ceph-mgr has been essentially unusable. top shows ceph-mgr consistently using between 1 and 4 cores, even without any optional modules loaded:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1289 ceph 20 0 2565980 1.5g 18156 S 369.1 40.0 126:26.77 ceph-mgr [root@ceph ~]# ceph mgr module ls | head { "enabled_modules": [] ...
Simple commands appear to hang, complete at random intervals, or fail outright:
[root@ceph ~]# time ceph balancer status { "last_optimize_duration": "", "plans": [], "mode": "none", "active": false, "optimize_result": "", "last_optimize_started": "" } real 0m32.232s user 0m0.284s sys 0m0.030s [root@ceph ~]# time ceph crash ls-new ID ENTITY NEW 2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0 * 2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1 * 2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0 * 2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1 * 2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0 * 2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1 * 2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1 * 2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1 * 2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0 * 2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0 * 2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0 * 2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0 * 2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0 * 2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0 * 2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0 * 2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0 * 2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0 * 2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0 * 2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0 * 2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0 * 2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1 * 2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1 * 2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1 * 2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1 * 2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1 * 2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1 * real 1m18.191s user 0m0.295s sys 0m0.042s [root@ceph ~]# time ceph osd status | head Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 914, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/status/module.py", line 253, in handle_command return self.handle_osd_status(cmd) File "/usr/share/ceph/mgr/status/module.py", line 237, in handle_osd_status mgr_util.format_dimless(self.get_rate("osd", osd_id.__str__(), "osd.op_w") + File "/usr/share/ceph/mgr/status/module.py", line 47, in get_rate return (data[-1][1] - data[-2][1]) / float(data[-1][0] - data[-2][0]) ZeroDivisionError: float division by zero real 3m8.023s user 0m0.348s sys 0m0.068s
One workaround to get ceph-mgr to respond appears to be to enable or disable a module -- any module. For example, the "ceph osd status" command that previously timed out, succeeded if the iostat module was enabled as the command was pending:
[root@ceph ~]# time ceph osd status | head +-----+------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +-----+------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 1 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 2 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 3 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 4 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 5 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 6 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | real 0m11.827s user 0m0.280s sys 0m0.035s
The "ceph crash ls-new" command which previously took over a minute returned more or less promptly as the iostat module was disabled:
[root@ceph ~]# time ceph crash ls-new ID ENTITY NEW 2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0 * 2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1 * 2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0 * 2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1 * 2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0 * 2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1 * 2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1 * 2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1 * 2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0 * 2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0 * 2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0 * 2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0 * 2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0 * 2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0 * 2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0 * 2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0 * 2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0 * 2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0 * 2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0 * 2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0 * 2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1 * 2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1 * 2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1 * 2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1 * 2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1 * 2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1 * real 0m6.808s user 0m0.271s sys 0m0.028s
The dashboard, which had been working fine in v14.2.4, now loads very slowly and is sluggish to the point of being unusable.
Related issues
History
#1 Updated by Alex Walender almost 4 years ago
I can confirm the high cpu usage on our cluster after upgrading to 14.2.5.
MGR was unresponsive and crashing all the time, I got it somewhat stable after disabling zabbix and dashboard plugin.
Still, MGR takes 100% - 200% CPU.
The mgr log does not show something suspicious. The only "weird" looking and repeating entries are:
log [DBG] : from='client.313719929 -' entity='client.admin' cmd=[{"prefix": "crash post", "target": ["mgr", ""]}]: dispatch
Seems that this upgrade has introduced a leak somewhere.
#2 Updated by Neha Ojha almost 4 years ago
- Status changed from New to Duplicate
#3 Updated by Nathan Cutler over 3 years ago
- Duplicates Bug #43364: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added
#4 Updated by Sebastian Wagner over 3 years ago
- Description updated (diff)
#5 Updated by Lenz Grimmer over 3 years ago
- Related to Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 added
#6 Updated by Lenz Grimmer over 3 years ago
- Related to Bug #39264: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors added