Bug #62658
error during scrub thrashing: reached maximum tries (31) after waiting for 900 seconds
0%
Description
/a/vshankar-2023-08-24_07:29:19-fs-wip-vshankar-testing-20230824.045828-testing-default-smithi/7378338
2023-08-24T09:12:27.873 INFO:tasks.cephfs.filesystem:scrub status for tag:12ec0e53-e310-41e0-82b7-80210c6c3553 - {'path': '/', 'tag': '12ec0e53-e310-41e0-82b7-80210c6c3553', 'options': 'recursive,force'} 2023-08-24T09:12:27.873 ERROR:tasks.fwd_scrub.fs.[cephfs]:exception: Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_ceph-c_eef54c669727cef06f428721567dad125542a9ae/qa/tasks/fwd_scrub.py", line 38, in _run self.do_scrub() File "/home/teuthworker/src/git.ceph.com_ceph-c_eef54c669727cef06f428721567dad125542a9ae/qa/tasks/fwd_scrub.py", line 55, in do_scrub self._scrub() File "/home/teuthworker/src/git.ceph.com_ceph-c_eef54c669727cef06f428721567dad125542a9ae/qa/tasks/fwd_scrub.py", line 76, in _scrub done = self.fs.wait_until_scrub_complete(tag=tag, sleep=30, timeout=self.scrub_timeout) File "/home/teuthworker/src/git.ceph.com_ceph-c_eef54c669727cef06f428721567dad125542a9ae/qa/tasks/cephfs/filesystem.py", line 1727, in wait_until_scrub_complete while proceed(): File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/contextutil.py", line 134, in __call__ raise MaxWhileTries(error_msg) teuthology.exceptions.MaxWhileTries: reached maximum tries (31) after waiting for 900 seconds
Related issues
History
#1 Updated by Venky Shankar 3 months ago
- Assignee set to Milind Changire
Milind, PTAL. I vaguely recall a similar issue you were looking into a while back.
#2 Updated by Milind Changire 3 months ago
This is a job with scrubbing on dir frags on a set of replicas.
Interestingly there's no trace of handle_fragment_notify() being called in the replica mds.1 after the frags are created.
That seems to be the reason for the following logs:
2023-08-24T08:53:59.883+0000 7f1627828700 10 mds.1.scrubstack handle_scrub mds_scrub(queue_dir 0x100000004bb fragset_t(00*) 4ac785b5-b511-46be-b785-dcc97f8a33dd force recursive) v1 from mds.0 2023-08-24T08:53:59.883+0000 7f1627828700 10 mds.1.scrubstack handle_scrub no frag 00* ... ...
Whereas we have the following logs from mds.0
2023-08-24T08:53:59.880+0000 7f595ae42700 20 mds.0.scrubstack scrub_dir_inode recursive mode, frags [111*,110*,101*,100*,011*,010*,001*,000*] 2023-08-24T08:53:59.880+0000 7f595ae42700 20 mds.0.scrubstack scrub_dir_inode forward fragset_t(00*,110*) to mds.1
#3 Updated by Milind Changire 3 months ago
okay, so there are a few handle_fragment_notify logs but not as many handle_fragment_notify_ack logs.
that seems to be the problem
after designating a different replica for CDir 0x100000004bb, the mds sends over the dirfrags to the replica for dirfrag management
later the scrubbing kicks in, and tries to queue the scrub job to the remote mds ... and the remote (replica) later finds that it doesn't have any trace for the requested frag
that's the reason that the CDir remains in the ScrubStack forever ... eventually causing the qa test to bail out due to a timeout
#4 Updated by Venky Shankar 3 months ago
Milind Changire wrote:
okay, so there are a few handle_fragment_notify logs but not as many handle_fragment_notify_ack logs.
that seems to be the problemafter designating a different replica for CDir 0x100000004bb, the mds sends over the dirfrags to the replica for dirfrag management
That's really odd -- test are run with debug_ms=1, so you should see the `MMDSFragmentNotify` message incoming on the replica mds.
#5 Updated by Venky Shankar 3 months ago
- Labels (FS) Manila added
3 instances from my run:
- https://pulpito.ceph.com/vshankar-2023-09-20_10:42:39-fs-wip-vshankar-testing-20230920.072635-testing-default-smithi/7399182/
- https://pulpito.ceph.com/vshankar-2023-09-20_10:42:39-fs-wip-vshankar-testing-20230920.072635-testing-default-smithi/7399329/
- https://pulpito.ceph.com/vshankar-2023-09-20_10:42:39-fs-wip-vshankar-testing-20230920.072635-testing-default-smithi/7399163/
#6 Updated by Venky Shankar 3 months ago
- Labels (FS) deleted (
Manila)
#7 Updated by Milind Changire 2 months ago
- Pull request ID set to 53636
#8 Updated by Xiubo Li about 2 months ago
- Status changed from New to Fix Under Review
#11 Updated by Patrick Donnelly about 1 month ago
- Status changed from Fix Under Review to Pending Backport
- Source set to Q/A
- Backport changed from reef,quincy,pacific to reef,quincy
#12 Updated by Backport Bot about 1 month ago
- Copied to Backport #63416: reef: error during scrub thrashing: reached maximum tries (31) after waiting for 900 seconds added
#13 Updated by Backport Bot about 1 month ago
- Copied to Backport #63417: quincy: error during scrub thrashing: reached maximum tries (31) after waiting for 900 seconds added
#14 Updated by Backport Bot about 1 month ago
- Tags set to backport_processed