Project

General

Profile

Bug #43317

Updated by Sebastian Wagner almost 4 years ago

Since upgrading to Nautilus v14.2.5, ceph-mgr has been essentially unusable.    top shows ceph-mgr consistently using between 1 and 4 cores, even without any optional modules loaded: 

 <pre> 
   

   PID USER        PR    NI      VIRT      RES      SHR S    %CPU %MEM       TIME+ COMMAND                                 
  1289 ceph        20     0 2565980     1.5g    18156 S 369.1 40.0 126:26.77 ceph-mgr                        

 [root@ceph ~]# ceph mgr module ls | head 
 { 
     "enabled_modules": [] 
 ... 
 </pre> 

 Simple commands appear to hang, complete at random intervals, or fail outright: 

 <pre> 
 [root@ceph ~]# time ceph balancer status 
 { 
     "last_optimize_duration": "",  
     "plans": [],  
     "mode": "none",  
     "active": false,  
     "optimize_result": "",  
     "last_optimize_started": "" 
 } 

 real      0m32.232s 
 user      0m0.284s 
 sys       0m0.030s 

 [root@ceph ~]# time ceph crash ls-new 
 ID                                                                 ENTITY NEW  
 2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0     *   
 2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1     *   
 2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0     *   
 2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1     *   
 2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0     *   
 2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1     *   
 2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1     *   
 2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1     *   
 2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0     *   
 2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0     *   
 2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0     *   
 2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0     *   
 2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0     *   
 2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0     *   
 2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0     *   
 2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0     *   
 2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0     *   
 2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0     *   
 2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0     *   
 2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0     *   
 2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1     *   
 2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1     *   
 2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1     *   
 2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1     *   
 2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1     *   
 2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1     *   

 real      1m18.191s 
 user      0m0.295s 
 sys       0m0.042s 

 [root@ceph ~]# time ceph osd status | head 
 Error EINVAL: Traceback (most recent call last): 
   File "/usr/share/ceph/mgr/mgr_module.py", line 914, in _handle_command 
     return self.handle_command(inbuf, cmd) 
   File "/usr/share/ceph/mgr/status/module.py", line 253, in handle_command 
     return self.handle_osd_status(cmd) 
   File "/usr/share/ceph/mgr/status/module.py", line 237, in handle_osd_status 
     mgr_util.format_dimless(self.get_rate("osd", osd_id.__str__(), "osd.op_w") + 
   File "/usr/share/ceph/mgr/status/module.py", line 47, in get_rate 
     return (data[-1][1] - data[-2][1]) / float(data[-1][0] - data[-2][0]) 
 ZeroDivisionError: float division by zero 


 real      3m8.023s 
 user      0m0.348s 
 sys       0m0.068s 
 </pre> 


 One workaround to get ceph-mgr to respond appears to be to enable or disable a module -- any module.    For example, the "ceph osd status" command that previously timed out, succeeded if the iostat module was enabled as the command was pending: 

 <pre> 
 [root@ceph ~]# time ceph osd status | head 
 +-----+------+-------+-------+--------+---------+--------+---------+-----------+ 
 |    id | host |    used | avail | wr ops | wr data | rd ops | rd data |     state     | 
 +-----+------+-------+-------+--------+---------+--------+---------+-----------+ 
 |    0    |        |      0    |      0    |      0     |       0     |      0     |       0     | exists,up | 
 |    1    |        |      0    |      0    |      0     |       0     |      0     |       0     | exists,up | 
 |    2    |        |      0    |      0    |      0     |       0     |      0     |       0     | exists,up | 
 |    3    |        |      0    |      0    |      0     |       0     |      0     |       0     | exists,up | 
 |    4    |        |      0    |      0    |      0     |       0     |      0     |       0     | exists,up | 
 |    5    |        |      0    |      0    |      0     |       0     |      0     |       0     | exists,up | 
 |    6    |        |      0    |      0    |      0     |       0     |      0     |       0     | exists,up | 

 real      0m11.827s 
 user      0m0.280s 
 sys       0m0.035s 
 </pre> 

 The "ceph crash ls-new" command which previously took over a minute returned more or less promptly as the iostat module was disabled: 

 <pre> 
 [root@ceph ~]# time ceph crash ls-new 
 ID                                                                 ENTITY NEW  
 2019-07-24_00:55:04.694908Z_bc80ad27-e293-45c3-9bdb-5fd4982b8edc osd.0     *   
 2019-07-24_00:55:06.003526Z_f0027b87-1654-491b-9b2d-a168322d44ee osd.1     *   
 2019-07-24_00:55:06.735785Z_2075c453-fbd2-4f2c-8211-dafafc73c6b8 osd.0     *   
 2019-07-24_00:55:07.460188Z_a02dec84-be70-4665-b2c5-323e067af7b9 osd.1     *   
 2019-07-24_00:55:08.206238Z_477ce6da-1d59-47da-af95-e6126183a9a3 osd.0     *   
 2019-07-24_00:58:06.215514Z_2bebd851-fb7b-423d-b326-e973d425c13b osd.1     *   
 2019-07-24_00:58:08.246643Z_70511d52-c029-4211-80b2-01e312bef787 osd.1     *   
 2019-07-24_00:58:09.734660Z_01cc78e2-235c-41b7-9b69-5e04b3804146 osd.1     *   
 2019-07-24_01:01:41.166534Z_f15d6aae-81ec-4bc1-9923-b05735b69d78 osd.0     *   
 2019-07-24_01:01:43.180687Z_3fd3ae07-65ed-4794-9fe7-1f9fb388dd2c osd.0     *   
 2019-07-24_01:01:44.791007Z_c0ab06f0-57da-4ed8-97ac-83f20dd40d06 osd.0     *   
 2019-07-24_01:03:24.116695Z_c9a0e53d-282f-45f4-9ef8-e0fa516769fb osd.0     *   
 2019-07-24_01:03:26.272656Z_3234791d-10d3-4c4e-91f9-8d0a22edc9e9 osd.0     *   
 2019-07-24_01:03:27.739259Z_2fb70aa5-6cdd-4d84-a5f6-bfc02b58730f osd.0     *   
 2019-07-24_01:06:49.873704Z_61c6dd3f-4d9f-4d48-9beb-7669ebf55b46 osd.0     *   
 2019-07-24_01:06:51.986488Z_3479bfac-d80b-4659-b725-76ce264016ec osd.0     *   
 2019-07-24_01:06:53.462454Z_7115d298-a236-4f29-b086-2f181230bbfc osd.0     *   
 2019-07-24_01:14:53.936563Z_cd9adbd4-ca35-4b86-9c31-82129dbc7107 osd.0     *   
 2019-07-24_01:14:55.990911Z_54ac3f15-8f1a-4b05-b45a-5b57b64e2770 osd.0     *   
 2019-07-24_01:14:57.478702Z_aaa89ad0-ff89-4a89-b7f1-a37cf142fa7d osd.0     *   
 2019-07-24_01:18:03.349167Z_0f48a4b4-6528-444a-b19e-de624db20adb osd.1     *   
 2019-07-24_01:18:05.089556Z_cadb27a7-c609-43f2-8a9c-b32a751dabcb osd.1     *   
 2019-07-24_01:18:06.531971Z_a3d7f660-0680-4785-bc48-e35a9ca6c8d1 osd.1     *   
 2019-07-24_01:19:03.301233Z_e9b70074-ac14-42d0-90a6-39fade50a18e osd.1     *   
 2019-07-24_01:19:04.990424Z_190b63ae-b00f-4cec-b737-ad823a6d10c8 osd.1     *   
 2019-07-24_01:19:06.233479Z_4b705469-630f-4bba-805f-322274f2038d osd.1     *   

 real      0m6.808s 
 user      0m0.271s 
 sys       0m0.028s 
 </pre> 

 The dashboard, which had been working fine in v14.2.4, now loads very slowly and is sluggish to the point of being unusable.

Back