Bug #43317
Updated by Sebastian Wagner almost 4 years ago
Since upgrading to Nautilus v14.2.5, ceph-mgr has been essentially unusable. top shows ceph-mgr consistently using between 1 and 4 cores, even without any optional modules loaded: <pre> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1289 ceph 20 0 2565980 1.5g 18156 S 369.1 40.0 126:26.77 ceph-mgr [root@ceph ~]# ceph mgr module ls | head { "enabled_modules": [] ... </pre> Simple commands appear to hang, complete at random intervals, or fail outright: <pre> [root@ceph ~]# time ceph balancer status { "last_optimize_duration": "", "plans": [], "mode": "none", "active": false, "optimize_result": "", "last_optimize_started": "" } real 0m32.232s user 0m0.284s sys 0m0.030s [root@ceph ~]# time ceph crash ls-new ID ENTITY NEW 2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0 * 2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1 * 2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0 * 2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1 * 2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0 * 2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1 * 2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1 * 2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1 * 2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0 * 2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0 * 2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0 * 2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0 * 2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0 * 2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0 * 2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0 * 2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0 * 2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0 * 2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0 * 2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0 * 2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0 * 2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1 * 2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1 * 2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1 * 2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1 * 2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1 * 2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1 * real 1m18.191s user 0m0.295s sys 0m0.042s [root@ceph ~]# time ceph osd status | head Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 914, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/status/module.py", line 253, in handle_command return self.handle_osd_status(cmd) File "/usr/share/ceph/mgr/status/module.py", line 237, in handle_osd_status mgr_util.format_dimless(self.get_rate("osd", osd_id.__str__(), "osd.op_w") + File "/usr/share/ceph/mgr/status/module.py", line 47, in get_rate return (data[-1][1] - data[-2][1]) / float(data[-1][0] - data[-2][0]) ZeroDivisionError: float division by zero real 3m8.023s user 0m0.348s sys 0m0.068s </pre> One workaround to get ceph-mgr to respond appears to be to enable or disable a module -- any module. For example, the "ceph osd status" command that previously timed out, succeeded if the iostat module was enabled as the command was pending: <pre> [root@ceph ~]# time ceph osd status | head +-----+------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +-----+------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 1 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 2 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 3 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 4 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 5 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | | 6 | | 0 | 0 | 0 | 0 | 0 | 0 | exists,up | real 0m11.827s user 0m0.280s sys 0m0.035s </pre> The "ceph crash ls-new" command which previously took over a minute returned more or less promptly as the iostat module was disabled: <pre> [root@ceph ~]# time ceph crash ls-new ID ENTITY NEW 2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0 * 2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1 * 2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0 * 2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1 * 2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0 * 2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1 * 2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1 * 2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1 * 2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0 * 2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0 * 2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0 * 2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0 * 2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0 * 2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0 * 2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0 * 2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0 * 2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0 * 2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0 * 2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0 * 2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0 * 2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1 * 2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1 * 2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1 * 2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1 * 2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1 * 2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1 * real 0m6.808s user 0m0.271s sys 0m0.028s </pre> The dashboard, which had been working fine in v14.2.4, now loads very slowly and is sluggish to the point of being unusable.