Project

General

Profile

Actions

Bug #51092

closed

mds: Timed out waiting for MDS daemons to become healthy

Added by Xiubo Li almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From https://pulpito.ceph.com/xiubli-2021-06-03_11:25:58-fs:functional:-wip-lxb-20210603-45434-pg-num-distro-basic-smithi/6149289/:

2021-06-03T11:59:43.755 INFO:tasks.cephfs_test_runner:test_full_fsync (tasks.cephfs.test_full.TestClusterFull) ... ERROR
2021-06-03T11:59:43.756 INFO:tasks.cephfs_test_runner:
2021-06-03T11:59:43.756 INFO:tasks.cephfs_test_runner:======================================================================
2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:ERROR: test_full_fsync (tasks.cephfs.test_full.TestClusterFull)
2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-06-03T11:59:43.757 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/test_full.py", line 395, in setUp
2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner:    super(TestClusterFull, self).setUp()
2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/test_full.py", line 32, in setUp
2021-06-03T11:59:43.758 INFO:tasks.cephfs_test_runner:    CephFSTestCase.setUp(self)
2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/cephfs_test_case.py", line 169, in setUp
2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_daemons()
2021-06-03T11:59:43.759 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_986f0ca109bf8b3eec6027f40cf5b33b8b7bbaf4/qa/tasks/cephfs/filesystem.py", line 1075, in wait_for_daemons
2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:    raise RuntimeError("Timed out waiting for MDS daemons to become healthy")
2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out waiting for MDS daemons to become healthy
2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:
2021-06-03T11:59:43.760 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner:Ran 4 tests in 416.483s
2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner:
2021-06-03T11:59:43.761 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
2021-06-03T11:59:43.762 INFO:tasks.cephfs_test_runner:

Checked the `mds.c` daemon log, it seems since the osd is full and the mds.c kept up:creating state:

2021-06-03T11:55:58.894+0000 7f01a0e67700  7 mds.0.server operator(): full = 1 epoch = 92
2021-06-03T11:55:58.894+0000 7f01a0e67700  4 mds.0.39 handle_osd_map epoch 92, 0 new blocklist entries
2021-06-03T11:55:58.894+0000 7f01a0e67700 10 mds.0.server apply_blocklist: killed 0
...
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c my gid is 6078
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c map says I am mds.0.39 state up:creating
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c msgr says I am [v2:172.21.15.25:6834/3032317271,v1:172.21.15.25:6835/3032317271]
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 mds.c handle_mds_map: handling map as rank 0
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 notify_mdsmap: mds.metrics
2021-06-03T11:56:01.110+0000 7f01a0e67700 10 notify_mdsmap: mds.metrics: rank0 is unavailable

From osd.4 log, the osd.4 daemon was in failsafe state:

2021-06-03T11:56:10.693+0000 7f814d502700 20 osd.4 93 check_full_status cur ratio 9.22337e+10, physical ratio 9.22337e+10, new state failsafe
Actions

Also available in: Atom PDF