Bug #51092
closedmds: Timed out waiting for MDS daemons to become healthy
0%
Description
2021-06-03T11:59:43.755 INFO:tasks.cephfs_test_runner:test_full_fsync (tasks.cephfs.test_full.TestClusterFull) ... ERROR 2021-06-03T11:59:43.756 INFO:tasks.cephfs_test_runner: 2021-06-03T11:59:43.756 INFO:tasks.cephfs_test_runner:====================================================================== 2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:ERROR: test_full_fsync (tasks.cephfs.test_full.TestClusterFull) 2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/test_full.py", line 395, in setUp 2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner: super(TestClusterFull, self).setUp() 2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/test_full.py", line 32, in setUp 2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner: CephFSTestCase.setUp(self) 2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/cephfs_test_case.py", line 169, in setUp 2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner: self.fs.wait_for_daemons() 2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/filesystem.py", line 1075, in wait_for_daemons 2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner: raise RuntimeError("Timed out waiting for MDS daemons to become healthy") 2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out waiting for MDS daemons to become healthy 2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner: 2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner:Ran 4 tests in 416.483s 2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner: 2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner:FAILED (errors=1) 2021-06-03T11:59:43.762 INFO:tasks.cephfs_test_runner:
Checked the `mds.c` daemon log, it seems since the osd is full and the mds.c kept up:creating state:
2021-06-03T11:55:58.894+0000 7f01a0e67700 7 mds.0.server operator(): full = 1 epoch = 92 2021-06-03T11:55:58.894+0000 7f01a0e67700 4 mds.0.39 handle_osd_map epoch 92, 0 new blocklist entries 2021-06-03T11:55:58.894+0000 7f01a0e67700 10 mds.0.server apply_blocklist: killed 0 ... 2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c my gid is 6078 2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c map says I am mds.0.39 state up:creating 2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c msgr says I am [v2:172.21.15.25:6834/3032317271,v1:172.21.15.25:6835/3032317271] 2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c handle_mds_map: handling map as rank 0 2021-06-03T11:56:01.110+0000 7f01a0e67700 10 notify_mdsmap: mds.metrics 2021-06-03T11:56:01.110+0000 7f01a0e67700 10 notify_mdsmap: mds.metrics: rank0 is unavailable
From osd.4 log, the osd.4 daemon was in failsafe state:
2021-06-03T11:56:10.693+0000 7f814d502700 20 osd.4 93 check_full_status cur ratio 9.22337e+10, physical ratio 9.22337e+10, new state failsafe
Updated by Patrick Donnelly almost 3 years ago
Xiubo, suggest you retest with a change to have wait_for_daemons also periodically run "ceph osd dump" and "ceph osd df" for debugging purposes. It looks like this may be a RADOS bug.
Updated by Xiubo Li almost 3 years ago
Patrick Donnelly wrote:
Xiubo, suggest you retest with a change to have wait_for_daemons also periodically run "ceph osd dump" and "ceph osd df" for debugging purposes. It looks like this may be a RADOS bug.
Sure, will do that.
Updated by Xiubo Li almost 3 years ago
From osd.4 logs we can see that the available size is 0xc037e7b = 192.2MB, but the total is 0xbebc200 = 190.7MB:
2021-06-03T11:56:10.693+0000 7f814d502700 5 osd.4 93 heartbeat osd_stat(store_statfs(0xc037e7b/0x0/0xbebc200, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0), peers [0,1,2,3,5,6,7] op hist []) 2021-06-03T11:56:10.693+0000 7f814d502700 20 osd.4 93 check_full_status cur ratio 9.22337e+10, physical ratio 9.22337e+10, new state failsafe 2021-06-03T11:56:10.693+0000 7f814d502700 1 -- 172.21.15.171:0/15202 --> [v2:172.21.15.25:6806/14862,v1:172.21.15.25:6807/14862] -- osd_ping(ping e93 up_from 13 ping_stamp 2021-06-03T11:56:10.695874+0000/254.858230591s send_stamp 254.858230591s delta_ub -0.032985143s) v5 -- 0x5616279f6180 con 0x561626fa1000
There must be something wrong with calculating the available size when releasing the disk spaces.
Some earlier abnormal logs:
2021-06-03T11:55:42.193+0000 7f815da76700 10 osd.4 79 tick_without_osd_lock 2021-06-03T11:55:42.193+0000 7f815da76700 10 memstore(/var/lib/ceph/osd/ceph-4) statfs 2021-06-03T11:55:42.193+0000 7f815da76700 10 memstore(/var/lib/ceph/osd/ceph-4) statfs: used_bytes: 18446744073708990649/200000000
Updated by Xiubo Li almost 3 years ago
- Status changed from New to Fix Under Review
Updated by Kefu Chai almost 3 years ago
- Status changed from Fix Under Review to Resolved