Project

General

Profile

Actions

Bug #62658

open

error during scrub thrashing: reached maximum tries (31) after waiting for 900 seconds

Added by Venky Shankar 8 months ago. Updated about 23 hours ago.

Status:
Pending Backport
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
reef,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
scrub
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2023-08-24_07:29:19-fs-wip-vshankar-testing-20230824.045828-testing-default-smithi/7378338

2023-08-24T09:12:27.873 INFO:tasks.cephfs.filesystem:scrub status for tag:12ec0e53-e310-41e0-82b7-80210c6c3553 - {'path': '/', 'tag': '12ec0e53-e310-41e0-82b7-80210c6c3553', 'options': 'recursive,force'}
2023-08-24T09:12:27.873 ERROR:tasks.fwd_scrub.fs.[cephfs]:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_eef54c669727cef06f428721567dad125542a9ae/qa/tasks/fwd_scrub.py", line 38, in _run
    self.do_scrub()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_eef54c669727cef06f428721567dad125542a9ae/qa/tasks/fwd_scrub.py", line 55, in do_scrub
    self._scrub()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_eef54c669727cef06f428721567dad125542a9ae/qa/tasks/fwd_scrub.py", line 76, in _scrub
    done = self.fs.wait_until_scrub_complete(tag=tag, sleep=30, timeout=self.scrub_timeout)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_eef54c669727cef06f428721567dad125542a9ae/qa/tasks/cephfs/filesystem.py", line 1727, in wait_until_scrub_complete
    while proceed():
  File "/home/teuthworker/src/git.ceph.com_teuthology_449a1bc2027504e7b3c3d7b30fa4178906581da7/teuthology/contextutil.py", line 134, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (31) after waiting for 900 seconds

Related issues 3 (3 open0 closed)

Related to CephFS - Bug #48680: mds: scrubbing stuck "scrub active (0 inodes in the stack)"NewMilind Changire

Actions
Copied to CephFS - Backport #63416: reef: error during scrub thrashing: reached maximum tries (31) after waiting for 900 secondsIn ProgressMilind ChangireActions
Copied to CephFS - Backport #63417: quincy: error during scrub thrashing: reached maximum tries (31) after waiting for 900 secondsIn ProgressMilind ChangireActions
Actions #1

Updated by Venky Shankar 8 months ago

  • Assignee set to Milind Changire

Milind, PTAL. I vaguely recall a similar issue you were looking into a while back.

Actions #2

Updated by Milind Changire 8 months ago

This is a job with scrubbing on dir frags on a set of replicas.
Interestingly there's no trace of handle_fragment_notify() being called in the replica mds.1 after the frags are created.
That seems to be the reason for the following logs:

2023-08-24T08:53:59.883+0000 7f1627828700 10 mds.1.scrubstack handle_scrub mds_scrub(queue_dir 0x100000004bb fragset_t(00*) 4ac785b5-b511-46be-b785-dcc97f8a33dd force recursive) v1 from mds.0
2023-08-24T08:53:59.883+0000 7f1627828700 10 mds.1.scrubstack handle_scrub no frag 00*
...
...

Whereas we have the following logs from mds.0

2023-08-24T08:53:59.880+0000 7f595ae42700 20 mds.0.scrubstack scrub_dir_inode recursive mode, frags [111*,110*,101*,100*,011*,010*,001*,000*]
2023-08-24T08:53:59.880+0000 7f595ae42700 20 mds.0.scrubstack scrub_dir_inode forward fragset_t(00*,110*) to mds.1

Actions #3

Updated by Milind Changire 8 months ago

okay, so there are a few handle_fragment_notify logs but not as many handle_fragment_notify_ack logs.
that seems to be the problem

after designating a different replica for CDir 0x100000004bb, the mds sends over the dirfrags to the replica for dirfrag management

later the scrubbing kicks in, and tries to queue the scrub job to the remote mds ... and the remote (replica) later finds that it doesn't have any trace for the requested frag

that's the reason that the CDir remains in the ScrubStack forever ... eventually causing the qa test to bail out due to a timeout

Actions #4

Updated by Venky Shankar 8 months ago

Milind Changire wrote:

okay, so there are a few handle_fragment_notify logs but not as many handle_fragment_notify_ack logs.
that seems to be the problem

after designating a different replica for CDir 0x100000004bb, the mds sends over the dirfrags to the replica for dirfrag management

That's really odd -- test are run with debug_ms=1, so you should see the `MMDSFragmentNotify` message incoming on the replica mds.

Actions #6

Updated by Venky Shankar 8 months ago

  • Labels (FS) deleted (Manila)
Actions #7

Updated by Milind Changire 7 months ago

  • Pull request ID set to 53636
Actions #11

Updated by Patrick Donnelly 6 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Source set to Q/A
  • Backport changed from reef,quincy,pacific to reef,quincy
Actions #12

Updated by Backport Bot 6 months ago

  • Copied to Backport #63416: reef: error during scrub thrashing: reached maximum tries (31) after waiting for 900 seconds added
Actions #13

Updated by Backport Bot 6 months ago

  • Copied to Backport #63417: quincy: error during scrub thrashing: reached maximum tries (31) after waiting for 900 seconds added
Actions #14

Updated by Backport Bot 6 months ago

  • Tags set to backport_processed
Actions #16

Updated by Venky Shankar 24 days ago

  • Related to Bug #48680: mds: scrubbing stuck "scrub active (0 inodes in the stack)" added
Actions

Also available in: Atom PDF