Project

General

Profile

Actions

Bug #51092

closed

mds: Timed out waiting for MDS daemons to become healthy

Added by Xiubo Li almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From https://pulpito.ceph.com/xiubli-2021-06-03_11:25:58-fs:functional:-wip-lxb-20210603-45434-pg-num-distro-basic-smithi/6149289/:

2021-06-03T11:59:43.755 INFO:tasks.cephfs_test_runner:test_full_fsync (tasks.cephfs.test_full.TestClusterFull) ... ERROR
2021-06-03T11:59:43.756 INFO:tasks.cephfs_test_runner:
2021-06-03T11:59:43.756 INFO:tasks.cephfs_test_runner:======================================================================
2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:ERROR: test_full_fsync (tasks.cephfs.test_full.TestClusterFull)
2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/test_full.py", line 395, in setUp
2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner:    super(TestClusterFull, self).setUp()
2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/test_full.py", line 32, in setUp
2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner:    CephFSTestCase.setUp(self)
2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/cephfs_test_case.py", line 169, in setUp
2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_daemons()
2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/filesystem.py", line 1075, in wait_for_daemons
2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:    raise RuntimeError("Timed out waiting for MDS daemons to become healthy")
2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out waiting for MDS daemons to become healthy
2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:
2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner:Ran 4 tests in 416.483s
2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner:
2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
2021-06-03T11:59:43.762 INFO:tasks.cephfs_test_runner:

Checked the `mds.c` daemon log, it seems since the osd is full and the mds.c kept up:creating state:

2021-06-03T11:55:58.894+0000 7f01a0e67700  7 mds.0.server operator(): full = 1 epoch = 92
2021-06-03T11:55:58.894+0000 7f01a0e67700  4 mds.0.39 handle_osd_map epoch 92, 0 new blocklist entries
2021-06-03T11:55:58.894+0000 7f01a0e67700 10 mds.0.server apply_blocklist: killed 0
...
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c my gid is 6078
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c map says I am mds.0.39 state up:creating
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c msgr says I am [v2:172.21.15.25:6834/3032317271,v1:172.21.15.25:6835/3032317271]
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c handle_mds_map: handling map as rank 0
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 notify_mdsmap: mds.metrics
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 notify_mdsmap: mds.metrics: rank0 is unavailable

From osd.4 log, the osd.4 daemon was in failsafe state:

2021-06-03T11:56:10.693+0000 7f814d502700 20 osd.4 93 check_full_status cur ratio 9.22337e+10, physical ratio 9.22337e+10, new state failsafe
Actions #1

Updated by Xiubo Li almost 3 years ago

  • Priority changed from Normal to High
Actions #2

Updated by Xiubo Li almost 3 years ago

  • Assignee set to Xiubo Li
Actions #3

Updated by Patrick Donnelly almost 3 years ago

Xiubo, suggest you retest with a change to have wait_for_daemons also periodically run "ceph osd dump" and "ceph osd df" for debugging purposes. It looks like this may be a RADOS bug.

Actions #4

Updated by Xiubo Li almost 3 years ago

Patrick Donnelly wrote:

Xiubo, suggest you retest with a change to have wait_for_daemons also periodically run "ceph osd dump" and "ceph osd df" for debugging purposes. It looks like this may be a RADOS bug.

Sure, will do that.

Actions #5

Updated by Xiubo Li almost 3 years ago

From osd.4 logs we can see that the available size is 0xc037e7b = 192.2MB, but the total is 0xbebc200 = 190.7MB:

2021-06-03T11:56:10.693+0000 7f814d502700  5 osd.4 93 heartbeat osd_stat(store_statfs(0xc037e7b/0x0/0xbebc200, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0), peers [0,1,2,3,5,6,7] op hist [])
2021-06-03T11:56:10.693+0000 7f814d502700 20 osd.4 93 check_full_status cur ratio 9.22337e+10, physical ratio 9.22337e+10, new state failsafe
2021-06-03T11:56:10.693+0000 7f814d502700  1 -- 172.21.15.171:0/15202 --> [v2:172.21.15.25:6806/14862,v1:172.21.15.25:6807/14862] -- osd_ping(ping e93 up_from 13 ping_stamp 2021-06-03T11:56:10.695874+0000/254.858230591s send_stamp 254.858230591s delta_ub -0.032985143s) v5 -- 0x5616279f6180 con 0x561626fa1000

There must be something wrong with calculating the available size when releasing the disk spaces.

Some earlier abnormal logs:


2021-06-03T11:55:42.193+0000 7f815da76700 10 osd.4 79 tick_without_osd_lock
2021-06-03T11:55:42.193+0000 7f815da76700 10 memstore(/var/lib/ceph/osd/ceph-4) statfs
2021-06-03T11:55:42.193+0000 7f815da76700 10 memstore(/var/lib/ceph/osd/ceph-4) statfs: used_bytes: 18446744073708990649/200000000
Actions #6

Updated by Xiubo Li almost 3 years ago

  • Pull request ID set to 41914
Actions #7

Updated by Xiubo Li almost 3 years ago

  • Status changed from New to Fix Under Review
Actions #8

Updated by Kefu Chai almost 3 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF