Project

General

Profile

Actions

Bug #55240

closed

mds: stuck 2 seconds and keeps retrying to find ino from auth MDS

Added by Venky Shankar about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

100%

Source:
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds, task(medium)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen here: https://pulpito.ceph.com/vshankar-2022-04-07_05:07:33-fs-master-testing-default-smithi/6780578/

Its an upgrade test. cephadm disables standby-replay and reduced max_mds to 1 before upgrade (waits for a single active mds by checking mdsmap). The test above has 2 active MDSs, with each MDS configured with a standby-replay MDS daemon. Disabling standby-replay goes fine - MDSs transition to standby. However, after setting `max_mds = 1', one of the active MDS is stuck in `up:stopping' and never transitions to `down:stopped'. The test fails after hitting "max job timeout"

Problematic MDS log: ./remote/smithi176/log/aa4d8d7c-b63b-11ec-8c36-001a4aab830c/ceph-mds.cephfs.smithi176.ttvthc.log.gz

2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 mds.cephfs.smithi176.ttvthc my gid is 24327
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 mds.cephfs.smithi176.ttvthc map says I am mds.1.8 state up:stopping
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 mds.cephfs.smithi176.ttvthc msgr says I am [v2:172.21.15.176:6824/2074966087,v1:172.21.15.176:6825/2074966087]
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 mds.cephfs.smithi176.ttvthc handle_mds_map: handling map as rank 1
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 notify_mdsmap: mds.metrics
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 notify_mdsmap: mds.metrics: rank0 is mds.cephfs.smithi116.vbikdi
2022-04-07T06:41:43.092+0000 7fe6ed0d3700  7 mds.1.8 mds has 1 queued contexts
2022-04-07T06:41:43.092+0000 7fe6ed0d3700 10 mds.1.8  finish 0x55c7560da940

The MDS seems to be waiting for some event to reach completion (maybe exporting a dir?).


Related issues 3 (0 open3 closed)

Copied to Linux kernel client - Bug #55377: kclient: mds revoke Fwb caps stuck after the kclient tries writebcak onceResolvedXiubo Li

Actions
Copied to CephFS - Backport #55658: quincy: mds: stuck 2 seconds and keeps retrying to find ino from auth MDSResolvedXiubo LiActions
Copied to CephFS - Backport #55659: pacific: mds: stuck 2 seconds and keeps retrying to find ino from auth MDSResolvedXiubo LiActions
Actions

Also available in: Atom PDF