Bug #43317: high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5 - mgr - Ceph

Actions

Copy link

Bug #43317

closed

high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5

Added by Dan Moraru over 4 years ago. Updated almost 4 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Yes

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v14.2.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Since upgrading to Nautilus v14.2.5, ceph-mgr has been essentially unusable. top shows ceph-mgr consistently using between 1 and 4 cores, even without any optional modules loaded:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                
 1289 ceph      20   0 2565980   1.5g  18156 S 369.1 40.0 126:26.77 ceph-mgr                       

[root@ceph ~]# ceph mgr module ls | head
{
    "enabled_modules": []
...

Simple commands appear to hang, complete at random intervals, or fail outright:

[root@ceph ~]# time ceph balancer status
{
    "last_optimize_duration": "", 
    "plans": [], 
    "mode": "none", 
    "active": false, 
    "optimize_result": "", 
    "last_optimize_started": "" 
}

real    0m32.232s
user    0m0.284s
sys     0m0.030s

[root@ceph ~]# time ceph crash ls-new
ID                                                               ENTITY NEW 
2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0   *  
2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1   *  
2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0   *  
2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1   *  
2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0   *  
2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1   *  
2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1   *  
2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1   *  
2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0   *  
2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0   *  
2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0   *  
2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0   *  
2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0   *  
2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0   *  
2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0   *  
2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0   *  
2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0   *  
2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0   *  
2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0   *  
2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0   *  
2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1   *  
2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1   *  
2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1   *  
2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1   *  
2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1   *  
2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1   *  

real    1m18.191s
user    0m0.295s
sys     0m0.042s

[root@ceph ~]# time ceph osd status | head
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 914, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/status/module.py", line 253, in handle_command
    return self.handle_osd_status(cmd)
  File "/usr/share/ceph/mgr/status/module.py", line 237, in handle_osd_status
    mgr_util.format_dimless(self.get_rate("osd", osd_id.__str__(), "osd.op_w") +
  File "/usr/share/ceph/mgr/status/module.py", line 47, in get_rate
    return (data[-1][1] - data[-2][1]) / float(data[-1][0] - data[-2][0])
ZeroDivisionError: float division by zero

real    3m8.023s
user    0m0.348s
sys     0m0.068s

One workaround to get ceph-mgr to respond appears to be to enable or disable a module -- any module. For example, the "ceph osd status" command that previously timed out, succeeded if the iostat module was enabled as the command was pending:

[root@ceph ~]# time ceph osd status | head
+-----+------+-------+-------+--------+---------+--------+---------+-----------+
|  id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+-----+------+-------+-------+--------+---------+--------+---------+-----------+
|  0  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  1  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  2  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  3  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  4  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  5  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |
|  6  |      |    0  |    0  |    0   |     0   |    0   |     0   | exists,up |

real    0m11.827s
user    0m0.280s
sys     0m0.035s

The "ceph crash ls-new" command which previously took over a minute returned more or less promptly as the iostat module was disabled:

[root@ceph ~]# time ceph crash ls-new
ID                                                               ENTITY NEW 
2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0   *  
2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1   *  
2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0   *  
2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1   *  
2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0   *  
2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1   *  
2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1   *  
2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1   *  
2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0   *  
2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0   *  
2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0   *  
2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0   *  
2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0   *  
2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0   *  
2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0   *  
2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0   *  
2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0   *  
2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0   *  
2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0   *  
2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0   *  
2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1   *  
2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1   *  
2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1   *  
2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1   *  
2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1   *  
2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1   *  

real    0m6.808s
user    0m0.271s
sys     0m0.028s

The dashboard, which had been working fine in v14.2.4, now loads very slowly and is sluggish to the point of being unusable.

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Alex Walender over 4 years ago

I can confirm the high cpu usage on our cluster after upgrading to 14.2.5.

MGR was unresponsive and crashing all the time, I got it somewhat stable after disabling zabbix and dashboard plugin.
Still, MGR takes 100% - 200% CPU.

The mgr log does not show something suspicious. The only "weird" looking and repeating entries are:

log [DBG] : from='client.313719929 -' entity='client.admin' cmd=[{"prefix": "crash post", "target": ["mgr", ""]}]: dispatch

Seems that this upgrade has introduced a leak somewhere.

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Status changed from New to Duplicate

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

Is duplicate of Bug #43364: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

Actions

Copy link

Updated by Sebastian Wagner almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Lenz Grimmer almost 4 years ago

Related to Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 added

Actions

Copy link

Updated by Lenz Grimmer almost 4 years ago

Related to Bug #39264: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #43317

high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5

Updated by Alex Walender over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Nathan Cutler about 4 years ago

Updated by Sebastian Wagner almost 4 years ago

Updated by Lenz Grimmer almost 4 years ago

Updated by Lenz Grimmer almost 4 years ago