Project

General

Profile

Bug #48411

tasks.cephfs.test_volumes.TestSubvolumeGroups: RuntimeError: rank all failed to reach desired subtree state

Added by Jeff Layton 5 months ago. Updated 15 days ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
Testing
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS, qa-suite
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I got this failure when doing some testing with the draft fscache rework. It looks unrelated to the kernel changes, and more like a bug in the MDS:

2020-12-01T14:17:03.365 INFO:tasks.cephfs_test_runner:test_subvolumegroup_pin_distributed (tasks.cephfs.test_volumes.TestSubvolumeGroups) ... ERROR
2020-12-01T14:17:03.366 INFO:tasks.cephfs_test_runner:                                              
2020-12-01T14:17:03.367 INFO:tasks.cephfs_test_runner:======================================================================
2020-12-01T14:17:03.368 INFO:tasks.cephfs_test_runner:ERROR: test_subvolumegroup_pin_distributed (tasks.cephfs.test_volumes.TestSubvolumeGroups)
2020-12-01T14:17:03.368 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-12-01T14:17:03.369 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):            
2020-12-01T14:17:03.369 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jtlayton_ceph_k-stock/qa/tasks/cephfs/cephfs_test_case.py", line 380, in _wait_distributed_subtrees
2020-12-01T14:17:03.369 INFO:tasks.cephfs_test_runner:    while proceed():                          
2020-12-01T14:17:03.370 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 133, in __call__
2020-12-01T14:17:03.370 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)            
2020-12-01T14:17:03.370 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: reached maximum tries (20) after waiting for 100 seconds
2020-12-01T14:17:03.371 INFO:tasks.cephfs_test_runner:                                              
2020-12-01T14:17:03.371 INFO:tasks.cephfs_test_runner:The above exception was the direct cause of the following exception:
2020-12-01T14:17:03.371 INFO:tasks.cephfs_test_runner:                                              
2020-12-01T14:17:03.371 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):            
2020-12-01T14:17:03.372 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jtlayton_ceph_k-stock/qa/tasks/cephfs/test_volumes.py", line 694, in test_subvolumegroup_pin_distributed
2020-12-01T14:17:03.372 INFO:tasks.cephfs_test_runner:    self._wait_distributed_subtrees(2 * 2, status=status, rank="all")
2020-12-01T14:17:03.373 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_jtlayton_ceph_k-stock/qa/tasks/cephfs/cephfs_test_case.py", line 389, in _wait_distributed_subtrees
2020-12-01T14:17:03.374 INFO:tasks.cephfs_test_runner:    raise RuntimeError("rank {0} failed to reach desired subtree state".format(rank)) from e
2020-12-01T14:17:03.374 INFO:tasks.cephfs_test_runner:RuntimeError: rank all failed to reach desired subtree state
2020-12-01T14:17:03.375 INFO:tasks.cephfs_test_runner:                                              
2020-12-01T14:17:03.375 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-12-01T14:17:03.375 INFO:tasks.cephfs_test_runner:Ran 17 tests in 581.397s                      
2020-12-01T14:17:03.376 INFO:tasks.cephfs_test_runner:                                              
2020-12-01T14:17:03.376 INFO:tasks.cephfs_test_runner:FAILED (errors=1)                             

See: https://pulpito.ceph.com/jlayton-2020-12-01_13:40:41-fs-master-wip-ceph-fscache-iter-basic-gibba/5671994/


Related issues

Copied to CephFS - Backport #50086: pacific: tasks.cephfs.test_volumes.TestSubvolumeGroups: RuntimeError: rank all failed to reach desired subtree state In Progress

History

#1 Updated by Patrick Donnelly 4 months ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • Priority changed from Normal to High
  • Target version set to v16.0.0
  • Source set to Q/A
  • Component(FS) MDS, qa-suite added
  • Labels (FS) qa-failure added

#2 Updated by Patrick Donnelly 3 months ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport set to pacific,octopus,nautilus

#3 Updated by Ramana Raja about 2 months ago

Patrick saw this again in pacific testing, https://pulpito.ceph.com/teuthology-2021-02-15_04:17:01-fs-pacific-distro-basic-smithi/5882403/

In Jeff's testing and in the one above the number of ephemerally distributed subtrees reaches 3, but expected value is at least 4 for test_subvolumegroup_pin_distributed() in test_volumes.py,
https://github.com/ceph/ceph/pull/36537/commits/e76abf517bf650262bc889f0361b970bf6c00881#diff-d3a7e3f3f24fff510b4d2a562b2093257b20b3908748c3432d14e460c449186bR665

A similar test is test_ephemeral_pin_distribution() in test_exports.py that passes in the same pacific testing run,
https://pulpito.ceph.com/teuthology-2021-02-15_04:17:01-fs-pacific-distro-basic-smithi/5882426/

In test_ephemeral_pin_distribution(), the mds_export_ephemeral_distributed_factor = 21, max_mds = 3, and the expected number of ephemerally distributed subtrees is 64,
whereas in test_subvolumegroup_pin_distributed(), the mds_export_ephemeral_distributed_factor = 2 (default), max_mds = 2, and the expected number of ephemerally distributed subtrees is 4.

Maybe raising the mds_export_ephemeral_distributed_factor to 3, or decreasing the expected number of ephemerally distributed subtrees to 3 in test_subvolumegroup_pin_distributed() is sufficient?

Zheng's PR https://github.com/ceph/ceph/pull/36537/ that limits the number of subtrees created by the ephemeral distributed pin.

#4 Updated by Patrick Donnelly about 2 months ago

Looks like the dirfrag was empty and closed:

2021-02-15T11:45:52.972+0000 7f7964236700 20 mds.0.cache trimming empty pinned subtree [dir 0x10000000001.01* /volumes/pinme/ [2,head] auth v=62 cv=62/62 dir_auth=0 state=1074266113|complete|auxsubtree f() n(v1) hs=0+0,ss=0+0 | frozen=0 subtree=1 dirty=0 authpin=0 0x55fbb4f4a000]
2021-02-15T11:45:52.972+0000 7f7964236700 10 mds.0.cache remove_subtree [dir 0x10000000001.01* /volumes/pinme/ [2,head] auth v=62 cv=62/62 dir_auth=0 state=1073741825|complete f() n(v1) hs=0+0,ss=0+0 | frozen=0 subtree=1 dirty=0 authpin=0 0x55fbb4f4a000]
2021-02-15T11:45:52.972+0000 7f7964236700 14 mds.0.cache.ino(0x10000000001) close_dirfrag 01*
2021-02-15T11:45:52.972+0000 7f7964236700 12 mds.0.cache.dir(0x10000000001.01*) remove_null_dentries [dir 0x10000000001.01* /volumes/pinme/ [2,head] auth v=62 cv=62/62 dir_auth=0 state=1073741825|complete f() n(v1) hs=0+0,ss=0+0 0x55fbb4f4a000]

From: /ceph/teuthology-archive/teuthology-2021-02-15_04:17:01-fs-pacific-distro-basic-smithi/5882403/remote/smithi096/log/ceph-mds.a.log.gz

Maybe try adding 50 subvolumes instead of 10 in order to ensure no fragment is empty.

#5 Updated by Ramana Raja 17 days ago

  • Assignee changed from Patrick Donnelly to Ramana Raja

#6 Updated by Ramana Raja 16 days ago

  • Category set to Testing
  • Status changed from Triaged to In Progress
  • Pull request ID set to 40509
  • ceph-qa-suite fs added

#7 Updated by Patrick Donnelly 15 days ago

  • Status changed from In Progress to Pending Backport
  • Backport changed from pacific,octopus,nautilus to pacific

#8 Updated by Backport Bot 15 days ago

  • Copied to Backport #50086: pacific: tasks.cephfs.test_volumes.TestSubvolumeGroups: RuntimeError: rank all failed to reach desired subtree state added

Also available in: Atom PDF