Project

General

Profile

Actions

Bug #65603

open

mds: quiesce timeout due to a freezing directory

Added by Leonid Usov 26 days ago. Updated 16 days ago.

Status:
Pending Backport
Priority:
High
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS, quiesce
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Analyzing one of the ETIMEDOUT error for a quiesce, looking at
https://pulpito.ceph.com/leonidus-2024-04-21_11:37:13-fs-wip-lusov-quiescer-distro-default-smithi/7666603/

The problem was with the following quiesce request:

2024-04-21T12:04:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:04:11.493+0000 7f49bd39f740  4 client.5370 mds_command: new command op to 4282 tid=0 multi_id=0 [{"prefix": "quiesce db", "roots": ["/"], "timeout": 60.0, "expiration": 90.0, "await": true}]
2024-04-21T12:04:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:04:11.493+0000 7f49bd39f740  1 -- 172.21.15.148:0/714775596 --> [v2:172.21.15.157:6833/2409369305,v1:172.21.15.157:6835/2409369305] -- command(tid 0: {"prefix": "quiesce db", "roots": ["/"], "timeout": 60.0, "expiration": 90.0, "await": true}) -- 0x556bb10c3980 con 0x556bb116fbc0

2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:05:11.495+0000 7f49bd39f740  1 -- 172.21.15.148:0/714775596 shutdown_connections
2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:05:11.495+0000 7f49bd39f740  1 -- 172.21.15.148:0/714775596 wait complete.
2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:05:11.495+0000 7f49bd39f740 20 client.5370 trim_cache size 0 max 16384
2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout:{
2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout:    "epoch": 9,
2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout:    "leader": 4282,
2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout:    "set_version": 22,
2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout:    "sets": {
2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout:        "75b48996": {
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:            "version": 22,
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:            "age_ref": 0.0,
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:            "state": {
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:                "name": "TIMEDOUT",
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:                "age": 0.0
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:            },
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:            "timeout": 60.0,
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:            "expiration": 90.0,
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:            "members": {
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:                "file:/": {
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:                    "excluded": false,
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:                    "state": {
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:                        "name": "QUIESCING",
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:                        "age": 60.0
2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout:                    }
2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout:                }
2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout:            }
2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout:        }
2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout:    }
2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout:}
2024-04-21T12:05:11.498 DEBUG:teuthology.orchestra.run:got remote process result: 110
2024-04-21T12:05:11.499 INFO:teuthology.orchestra.run.smithi148.stderr:Error ETIMEDOUT:
2024-04-21T12:05:11.499 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: -110 (ETIMEDOUT), stdout:

Extracting the information from the logs and deducing incomplete requests yields:

[          0x1] since 2024-04-21 12:04:11.493000: /
[0x10000000000] since 2024-04-21 12:04:11.494000: /client.0/
[0x100000017b0] since 2024-04-21 12:04:11.494000: /client.0/tmp/
[0x100000017b1] since 2024-04-21 12:04:11.495000: /client.0/tmp/ceph/
[0x10000002fe1] since 2024-04-21 12:04:11.496000: /client.0/tmp/ceph/src/
[0x10000003733] since 2024-04-21 12:04:11.504000: /client.0/tmp/ceph/src/libcephfs.cc
[0x10000003167] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_common.sh
[0x100000035ca] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/dmclock/
[0x10000003043] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/btrfs_ioc_test.c
[0x1000000305a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x10000002fe7] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/TODO
[0x10000003049] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x10000003419] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/crimson/
[0x100000036ca] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/isa
[0x10000003041] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/blkin/
[0x10000003a99] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mgr/
[0x10000003b8f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mypy.ini
[0x10000003b9f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/os/
[0x10000003b8b] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mstart.sh
[0x10000003615] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/exporter/
[0x100000036cb] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/jaegertracing/
[0x100000036c9] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/init
[0x10000003171] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cephadm/
[0x10000003047] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x100000035e3] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/erasure
[0x10000003726] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/kv/
[0x100000036c8] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/init
[0x100000035cb] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/doc/
[0x100000033f6] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/compressor/
[0x10000003b8c] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mstop.sh
[0x10000003614] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/etc
[0x10000003635] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/googletest/
[0x10000003734] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/libcephsqlite.cc
[0x10000003956] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/loadclass.sh
[0x10000003b91] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/neorados/
[0x10000003165] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph.conf.twoosds
[0x10000003bde] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/osd/
[0x100000036f2] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/journal/
[0x10000003045] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x10000003b1c] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mount.fuse.ceph
[0x10000003029] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/bash_completion/
[0x10000003b90] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/nasm
[0x10000003597] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/crush/
[0x10000002ff3] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/auth/
[0x10000003725] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/krbd.cc
[0x10000003b9b] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ocf/
[0x10000003636] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/include/
[0x10000003168] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_fuse.cc
[0x10000003292] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cls_acl.cc
[0x10000003042] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/boost_redis/
[0x10000003b1d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mount/
[0x1000000304a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x10000002ff2] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/arrow/
[0x10000003754] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/librbd/
[0x10000003736] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/librados
[0x10000003b99] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/objsync/
[0x1000000316c] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_osd.cc
[0x10000003294] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cmonctl
[0x1000000374d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/libradosstriper/
[0x10000003044] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/c
[0x1000000395f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mds/
[0x1000000361e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/extblkdev/
[0x1000000320a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cls/
[0x10000003735] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/libkmip/
[0x1000000316e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_syn.cc
[0x1000000316b] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_mon.cc
[0x10000002fe8] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/arch/
[0x10000002fe3] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/CMakeLists.txt
[0x1000000370f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/json_spirit/
[0x1000000305b] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x10000003b24] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mrgw.sh
[0x10000003b8e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mypy
[0x1000000371e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/key_value_store/
[0x10000002fe5] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/SimpleRADOSStriper.cc
[0x1000000305d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x10000003b96] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/objclass/
[0x100000031e3] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ckill.sh
[0x1000000316f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_ver.c
[0x10000003ade] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mon/
[0x10000003048] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x1000000316a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_mgr.cc
[0x10000002fe6] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/SimpleRADOSStriper.h
[0x100000035c9] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cstart.sh
[0x100000031e4] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/client/
[0x10000002fe4] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/README
[0x10000003b26] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/msg/
[0x10000003737] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/librados/
[0x10000003b8d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/multi
[0x10000003169] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_mds.cc
[0x1000000302f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/blk/
[0x10000003295] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/common/
[0x10000003629] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/fmt/
[0x10000002fe2] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/.gitignore
[0x1000000305c] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x10000003957] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/log/
[0x10000003166] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph.in
[0x10000003293] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cls_crypto.cc
[0x1000000395e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/logrotate.conf
[0x100000039ec] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/messages/
[0x1000000362a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/global/
[0x100000035b0] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/crypto/
[0x100000036ce] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/java/
[0x100000035da] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/dokan/
[0x10000003b25] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mrun
[0x10000003046] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x10000003059] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph
[0x1000000316d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_release
[0x10000003170] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_ver.h.in.cmake
total pending: 108
complete (min/max/avg/stdev): 6175 (0.0/4.282/0.71/0.46)

We can see that all of the pending quiesces are under @[0x10000002fe1] /client.0/tmp/ceph/src@. All inodes under the directory complain about the inability to authpin:

2024-04-21T12:04:11.506+0000 7fbb7d1d4640 10 mds.0.locker  can't auth_pin (freezing?), waiting to authpin [inode 0x10000003733 [1fa,head] /client.0/tmp/ceph/src/libcephfs.cc auth v4191 DIRTYPARENT s=65791 n(v0 rc2024-04-21T12:03:56.550134+0000 b65791 1=1+0) caps={4753=pAsLsXsFsc/-@2},l=4753(-1) | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 dirty=0 authpin=0 0x558d7a5ba680]

The parent dir appears to be frozen:

2024-04-21T12:04:11.603+0000 7fbb769c7640 10 mds.0.cache.dir(0x10000002fe1) _freeze_dir [dir 0x10000002fe1 /client.0/tmp/ceph/src/ [2,head] auth v=5484 cv=5322/5322 ap=1+0 state=1610895377|complete|freezingdir|fragmenting|dnpinnedfrag f(v0 m2024-04-21T12:04:09.612818+0000 103=51+52) n(v4 rc2024-04-21T12:04:10.288801+0000 b23531464 3168=2900+268)/n(v4 rc2024-04-21T12:04:09.612818+0000 b23531464 3065=2799+266) hs=103+2,ss=0+0 dirty=2 | child=1 dirty=1 waiter=1 authpin=1 0x558d780fe400]

Related issues 1 (1 open0 closed)

Copied to CephFS - Backport #65738: squid: mds: quiesce timeout due to a freezing directoryIn ProgressPatrick DonnellyActions
Actions

Also available in: Atom PDF