Project

General

Profile

Bug #43317

high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5

Added by Dan Moraru 4 months ago. Updated 3 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Since upgrading to Nautilus v14.2.5, ceph-mgr has been essentially unusable. top shows ceph-mgr consistently using between 1 and 4 cores, even without any optional modules loaded:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                
1289 ceph 20 0 2565980 1.5g 18156 S 369.1 40.0 126:26.77 ceph-mgr

[root@ceph ~]# ceph mgr module ls | head {
"enabled_modules": []
...

Simple commands appear to hang, complete at random intervals, or fail outright:

[root@ceph ~]# time ceph balancer status {
"last_optimize_duration": "",
"plans": [],
"mode": "none",
"active": false,
"optimize_result": "",
"last_optimize_started": ""
}

real 0m32.232s
user 0m0.284s
sys 0m0.030s

[root@ceph ~]# time ceph crash ls-new
ID ENTITY NEW
2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0 *
2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1 *
2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0 *
2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1 *
2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0 *
2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1 *
2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1 *
2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1 *
2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0 *
2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0 *
2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0 *
2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0 *
2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0 *
2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0 *
2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0 *
2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0 *
2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0 *
2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0 *
2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0 *
2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0 *
2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1 *
2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1 *
2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1 *
2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1 *
2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1 *
2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1 *

real 1m18.191s
user 0m0.295s
sys 0m0.042s

[root@ceph ~]# time ceph osd status | head
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 914, in handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/status/module.py", line 253, in handle_command
return self.handle_osd_status(cmd)
File "/usr/share/ceph/mgr/status/module.py", line 237, in handle_osd_status
mgr_util.format_dimless(self.get_rate("osd", osd_id.
_str__(), "osd.op_w") +
File "/usr/share/ceph/mgr/status/module.py", line 47, in get_rate
return (data[-1][1] - data[-2][1]) / float(data[-1][0] - data[-2][0])
ZeroDivisionError: float division by zero

real 3m8.023s
user 0m0.348s
sys 0m0.068s

One workaround to get ceph-mgr to respond appears to be to enable or disable a module -- any module. For example, the "ceph osd status" command that previously timed out, succeeded if the iostat module was enabled as the command was pending:

[root@ceph ~]# time ceph osd status | head
---------+-------+-------+--------+---------+--------+---------+-----------+
id host used avail wr ops wr data rd ops rd data state
---------+-------+-------+--------+---------+--------+---------+-----------+
0 0 0 0 0 0 0 exists,up
1 0 0 0 0 0 0 exists,up
2 0 0 0 0 0 0 exists,up
3 0 0 0 0 0 0 exists,up
4 0 0 0 0 0 0 exists,up
5 0 0 0 0 0 0 exists,up
6 0 0 0 0 0 0 exists,up

real 0m11.827s
user 0m0.280s
sys 0m0.035s

The "ceph crash ls-new" command which previously took over a minute returned more or less promptly as the iostat module was disabled:

[root@ceph ~]# time ceph crash ls-new
ID ENTITY NEW
2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0 *
2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1 *
2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0 *
2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1 *
2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0 *
2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1 *
2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1 *
2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1 *
2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0 *
2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0 *
2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0 *
2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0 *
2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0 *
2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0 *
2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0 *
2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0 *
2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0 *
2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0 *
2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0 *
2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0 *
2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1 *
2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1 *
2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1 *
2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1 *
2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1 *
2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1 *

real 0m6.808s
user 0m0.271s
sys 0m0.028s

The dashboard, which had been working fine in v14.2.4, now loads very slowly and is sluggish to the point of being unusable.

History

#1 Updated by Alex Walender 4 months ago

I can confirm the high cpu usage on our cluster after upgrading to 14.2.5.

MGR was unresponsive and crashing all the time, I got it somewhat stable after disabling zabbix and dashboard plugin.
Still, MGR takes 100% - 200% CPU.

The mgr log does not show something suspicious. The only "weird" looking and repeating entries are:

log [DBG] : from='client.313719929 -' entity='client.admin' cmd=[{"prefix": "crash post", "target": ["mgr", ""]}]: dispatch

Seems that this upgrade has introduced a leak somewhere.

#2 Updated by Neha Ojha 3 months ago

  • Status changed from New to Duplicate

Also available in: Atom PDF