Project

General

Profile

Bug #43317

high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5

Added by Dan Moraru almost 4 years ago. Updated over 3 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Since upgrading to Nautilus v14.2.5, ceph-mgr has been essentially unusable. top shows ceph-mgr consistently using between 1 and 4 cores, even without any optional modules loaded:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                
 1289 ceph      20   0 2565980   1.5g  18156 S 369.1 40.0 126:26.77 ceph-mgr                       

[root@ceph ~]# ceph mgr module ls | head
{
    "enabled_modules": []
...

Simple commands appear to hang, complete at random intervals, or fail outright:

[root@ceph ~]# time ceph balancer status
{
    "last_optimize_duration": "", 
    "plans": [], 
    "mode": "none", 
    "active": false, 
    "optimize_result": "", 
    "last_optimize_started": "" 
}

real    0m32.232s
user    0m0.284s
sys     0m0.030s

[root@ceph ~]# time ceph crash ls-new
ID                                                               ENTITY NEW 
2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0   *  
2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1   *  
2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0   *  
2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1   *  
2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0   *  
2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1   *  
2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1   *  
2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1   *  
2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0   *  
2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0   *  
2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0   *  
2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0   *  
2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0   *  
2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0   *  
2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0   *  
2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0   *  
2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0   *  
2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0   *  
2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0   *  
2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0   *  
2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1   *  
2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1   *  
2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1   *  
2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1   *  
2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1   *  
2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1   *  

real    1m18.191s
user    0m0.295s
sys     0m0.042s

[root@ceph ~]# time ceph osd status | head
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 914, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/status/module.py", line 253, in handle_command
    return self.handle_osd_status(cmd)
  File "/usr/share/ceph/mgr/status/module.py", line 237, in handle_osd_status
    mgr_util.format_dimless(self.get_rate("osd", osd_id.__str__(), "osd.op_w") +
  File "/usr/share/ceph/mgr/status/module.py", line 47, in get_rate
    return (data[-1][1] - data[-2][1]) / float(data[-1][0] - data[-2][0])
ZeroDivisionError: float division by zero

real    3m8.023s
user    0m0.348s
sys     0m0.068s

One workaround to get ceph-mgr to respond appears to be to enable or disable a module -- any module. For example, the "ceph osd status" command that previously timed out, succeeded if the iostat module was enabled as the command was pending:

[root@ceph ~]# time ceph osd status | head
+-----+------+-------+-------+--------+---------+--------+---------+-----------+
|  id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+-----+------+-------+-------+--------+---------+--------+---------+-----------+
|  0  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  1  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  2  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  3  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  4  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  5  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  6  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |

real    0m11.827s
user    0m0.280s
sys     0m0.035s

The "ceph crash ls-new" command which previously took over a minute returned more or less promptly as the iostat module was disabled:

[root@ceph ~]# time ceph crash ls-new
ID                                                               ENTITY NEW 
2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0   *  
2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1   *  
2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0   *  
2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1   *  
2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0   *  
2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1   *  
2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1   *  
2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1   *  
2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0   *  
2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0   *  
2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0   *  
2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0   *  
2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0   *  
2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0   *  
2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0   *  
2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0   *  
2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0   *  
2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0   *  
2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0   *  
2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0   *  
2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1   *  
2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1   *  
2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1   *  
2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1   *  
2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1   *  
2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1   *  

real    0m6.808s
user    0m0.271s
sys     0m0.028s

The dashboard, which had been working fine in v14.2.4, now loads very slowly and is sluggish to the point of being unusable.


Related issues

Related to mgr - Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 New
Related to mgr - Bug #39264: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors Resolved
Duplicates mgr - Bug #43364: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive Resolved

History

#1 Updated by Alex Walender almost 4 years ago

I can confirm the high cpu usage on our cluster after upgrading to 14.2.5.

MGR was unresponsive and crashing all the time, I got it somewhat stable after disabling zabbix and dashboard plugin.
Still, MGR takes 100% - 200% CPU.

The mgr log does not show something suspicious. The only "weird" looking and repeating entries are:

log [DBG] : from='client.313719929 -' entity='client.admin' cmd=[{"prefix": "crash post", "target": ["mgr", ""]}]: dispatch

Seems that this upgrade has introduced a leak somewhere.

#2 Updated by Neha Ojha almost 4 years ago

  • Status changed from New to Duplicate

#3 Updated by Nathan Cutler over 3 years ago

  • Duplicates Bug #43364: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

#4 Updated by Sebastian Wagner over 3 years ago

  • Description updated (diff)

#5 Updated by Lenz Grimmer over 3 years ago

  • Related to Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 added

#6 Updated by Lenz Grimmer over 3 years ago

  • Related to Bug #39264: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors added

Also available in: Atom PDF