Bug #48411
closedtasks.cephfs.test_volumes.TestSubvolumeGroups: RuntimeError: rank all failed to reach desired subtree state
0%
Description
I got this failure when doing some testing with the draft fscache rework. It looks unrelated to the kernel changes, and more like a bug in the MDS:
2020-12-01T14:17:03.365 INFO:tasks.cephfs_test_runner:test_subvolumegroup_pin_distributed (tasks.cephfs.test_volumes.TestSubvolumeGroups) ... ERROR 2020-12-01T14:17:03.366 INFO:tasks.cephfs_test_runner: 2020-12-01T14:17:03.367 INFO:tasks.cephfs_test_runner:====================================================================== 2020-12-01T14:17:03.368 INFO:tasks.cephfs_test_runner:ERROR: test_subvolumegroup_pin_distributed (tasks.cephfs.test_volumes.TestSubvolumeGroups) 2020-12-01T14:17:03.368 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2020-12-01T14:17:03.369 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2020-12-01T14:17:03.369 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_jtlayton_ceph_k-stock/qa/tasks/cephfs/cephfs_test_case.py", line 380, in _wait_distributed_subtrees 2020-12-01T14:17:03.369 INFO:tasks.cephfs_test_runner: while proceed(): 2020-12-01T14:17:03.370 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 133, in __call__ 2020-12-01T14:17:03.370 INFO:tasks.cephfs_test_runner: raise MaxWhileTries(error_msg) 2020-12-01T14:17:03.370 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: reached maximum tries (20) after waiting for 100 seconds 2020-12-01T14:17:03.371 INFO:tasks.cephfs_test_runner: 2020-12-01T14:17:03.371 INFO:tasks.cephfs_test_runner:The above exception was the direct cause of the following exception: 2020-12-01T14:17:03.371 INFO:tasks.cephfs_test_runner: 2020-12-01T14:17:03.371 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2020-12-01T14:17:03.372 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_jtlayton_ceph_k-stock/qa/tasks/cephfs/test_volumes.py", line 694, in test_subvolumegroup_pin_distributed 2020-12-01T14:17:03.372 INFO:tasks.cephfs_test_runner: self._wait_distributed_subtrees(2 * 2, status=status, rank="all") 2020-12-01T14:17:03.373 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_jtlayton_ceph_k-stock/qa/tasks/cephfs/cephfs_test_case.py", line 389, in _wait_distributed_subtrees 2020-12-01T14:17:03.374 INFO:tasks.cephfs_test_runner: raise RuntimeError("rank {0} failed to reach desired subtree state".format(rank)) from e 2020-12-01T14:17:03.374 INFO:tasks.cephfs_test_runner:RuntimeError: rank all failed to reach desired subtree state 2020-12-01T14:17:03.375 INFO:tasks.cephfs_test_runner: 2020-12-01T14:17:03.375 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2020-12-01T14:17:03.375 INFO:tasks.cephfs_test_runner:Ran 17 tests in 581.397s 2020-12-01T14:17:03.376 INFO:tasks.cephfs_test_runner: 2020-12-01T14:17:03.376 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
Updated by Patrick Donnelly over 3 years ago
- Status changed from New to Triaged
- Assignee set to Patrick Donnelly
- Priority changed from Normal to High
- Target version set to v16.0.0
- Source set to Q/A
- Component(FS) MDS, qa-suite added
- Labels (FS) qa-failure added
Updated by Patrick Donnelly over 3 years ago
- Target version changed from v16.0.0 to v17.0.0
- Backport set to pacific,octopus,nautilus
Updated by Ramana Raja about 3 years ago
Patrick saw this again in pacific testing, https://pulpito.ceph.com/teuthology-2021-02-15_04:17:01-fs-pacific-distro-basic-smithi/5882403/
In Jeff's testing and in the one above the number of ephemerally distributed subtrees reaches 3, but expected value is at least 4 for test_subvolumegroup_pin_distributed() in test_volumes.py,
https://github.com/ceph/ceph/pull/36537/commits/e76abf517bf650262bc889f0361b970bf6c00881#diff-d3a7e3f3f24fff510b4d2a562b2093257b20b3908748c3432d14e460c449186bR665
A similar test is test_ephemeral_pin_distribution() in test_exports.py that passes in the same pacific testing run,
https://pulpito.ceph.com/teuthology-2021-02-15_04:17:01-fs-pacific-distro-basic-smithi/5882426/
In test_ephemeral_pin_distribution(), the mds_export_ephemeral_distributed_factor = 21, max_mds = 3, and the expected number of ephemerally distributed subtrees is 64,
whereas in test_subvolumegroup_pin_distributed(), the mds_export_ephemeral_distributed_factor = 2 (default), max_mds = 2, and the expected number of ephemerally distributed subtrees is 4.
Maybe raising the mds_export_ephemeral_distributed_factor to 3, or decreasing the expected number of ephemerally distributed subtrees to 3 in test_subvolumegroup_pin_distributed() is sufficient?
Zheng's PR https://github.com/ceph/ceph/pull/36537/ that limits the number of subtrees created by the ephemeral distributed pin.
Updated by Patrick Donnelly about 3 years ago
Looks like the dirfrag was empty and closed:
2021-02-15T11:45:52.972+0000 7f7964236700 20 mds.0.cache trimming empty pinned subtree [dir 0x10000000001.01* /volumes/pinme/ [2,head] auth v=62 cv=62/62 dir_auth=0 state=1074266113|complete|auxsubtree f() n(v1) hs=0+0,ss=0+0 | frozen=0 subtree=1 dirty=0 authpin=0 0x55fbb4f4a000] 2021-02-15T11:45:52.972+0000 7f7964236700 10 mds.0.cache remove_subtree [dir 0x10000000001.01* /volumes/pinme/ [2,head] auth v=62 cv=62/62 dir_auth=0 state=1073741825|complete f() n(v1) hs=0+0,ss=0+0 | frozen=0 subtree=1 dirty=0 authpin=0 0x55fbb4f4a000] 2021-02-15T11:45:52.972+0000 7f7964236700 14 mds.0.cache.ino(0x10000000001) close_dirfrag 01* 2021-02-15T11:45:52.972+0000 7f7964236700 12 mds.0.cache.dir(0x10000000001.01*) remove_null_dentries [dir 0x10000000001.01* /volumes/pinme/ [2,head] auth v=62 cv=62/62 dir_auth=0 state=1073741825|complete f() n(v1) hs=0+0,ss=0+0 0x55fbb4f4a000]
From: /ceph/teuthology-archive/teuthology-2021-02-15_04:17:01-fs-pacific-distro-basic-smithi/5882403/remote/smithi096/log/ceph-mds.a.log.gz
Maybe try adding 50 subvolumes instead of 10 in order to ensure no fragment is empty.
Updated by Ramana Raja about 3 years ago
- Assignee changed from Patrick Donnelly to Ramana Raja
Updated by Ramana Raja about 3 years ago
- Category set to Testing
- Status changed from Triaged to In Progress
- Pull request ID set to 40509
- ceph-qa-suite fs added
Updated by Patrick Donnelly about 3 years ago
- Status changed from In Progress to Pending Backport
- Backport changed from pacific,octopus,nautilus to pacific
Updated by Backport Bot about 3 years ago
- Copied to Backport #50086: pacific: tasks.cephfs.test_volumes.TestSubvolumeGroups: RuntimeError: rank all failed to reach desired subtree state added
Updated by Loïc Dachary almost 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".