Actions
Bug #65603
openmds: quiesce timeout due to a freezing directory
Status:
Pending Backport
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Q/A
Tags:
backport_processed
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS, quiesce
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Analyzing one of the ETIMEDOUT error for a quiesce, looking at
https://pulpito.ceph.com/leonidus-2024-04-21_11:37:13-fs-wip-lusov-quiescer-distro-default-smithi/7666603/
The problem was with the following quiesce request:
2024-04-21T12:04:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:04:11.493+0000 7f49bd39f740 4 client.5370 mds_command: new command op to 4282 tid=0 multi_id=0 [{"prefix": "quiesce db", "roots": ["/"], "timeout": 60.0, "expiration": 90.0, "await": true}] 2024-04-21T12:04:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:04:11.493+0000 7f49bd39f740 1 -- 172.21.15.148:0/714775596 --> [v2:172.21.15.157:6833/2409369305,v1:172.21.15.157:6835/2409369305] -- command(tid 0: {"prefix": "quiesce db", "roots": ["/"], "timeout": 60.0, "expiration": 90.0, "await": true}) -- 0x556bb10c3980 con 0x556bb116fbc0 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:05:11.495+0000 7f49bd39f740 1 -- 172.21.15.148:0/714775596 shutdown_connections 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:05:11.495+0000 7f49bd39f740 1 -- 172.21.15.148:0/714775596 wait complete. 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stderr:2024-04-21T12:05:11.495+0000 7f49bd39f740 20 client.5370 trim_cache size 0 max 16384 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout:{ 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout: "epoch": 9, 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout: "leader": 4282, 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout: "set_version": 22, 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout: "sets": { 2024-04-21T12:05:11.496 INFO:teuthology.orchestra.run.smithi148.stdout: "75b48996": { 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "version": 22, 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "age_ref": 0.0, 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "state": { 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "name": "TIMEDOUT", 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "age": 0.0 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: }, 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "timeout": 60.0, 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "expiration": 90.0, 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "members": { 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "file:/": { 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "excluded": false, 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "state": { 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "name": "QUIESCING", 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: "age": 60.0 2024-04-21T12:05:11.497 INFO:teuthology.orchestra.run.smithi148.stdout: } 2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout: } 2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout: } 2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout: } 2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout: } 2024-04-21T12:05:11.498 INFO:teuthology.orchestra.run.smithi148.stdout:} 2024-04-21T12:05:11.498 DEBUG:teuthology.orchestra.run:got remote process result: 110 2024-04-21T12:05:11.499 INFO:teuthology.orchestra.run.smithi148.stderr:Error ETIMEDOUT: 2024-04-21T12:05:11.499 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: -110 (ETIMEDOUT), stdout:
Extracting the information from the logs and deducing incomplete requests yields:
[ 0x1] since 2024-04-21 12:04:11.493000: / [0x10000000000] since 2024-04-21 12:04:11.494000: /client.0/ [0x100000017b0] since 2024-04-21 12:04:11.494000: /client.0/tmp/ [0x100000017b1] since 2024-04-21 12:04:11.495000: /client.0/tmp/ceph/ [0x10000002fe1] since 2024-04-21 12:04:11.496000: /client.0/tmp/ceph/src/ [0x10000003733] since 2024-04-21 12:04:11.504000: /client.0/tmp/ceph/src/libcephfs.cc [0x10000003167] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_common.sh [0x100000035ca] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/dmclock/ [0x10000003043] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/btrfs_ioc_test.c [0x1000000305a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x10000002fe7] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/TODO [0x10000003049] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x10000003419] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/crimson/ [0x100000036ca] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/isa [0x10000003041] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/blkin/ [0x10000003a99] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mgr/ [0x10000003b8f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mypy.ini [0x10000003b9f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/os/ [0x10000003b8b] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mstart.sh [0x10000003615] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/exporter/ [0x100000036cb] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/jaegertracing/ [0x100000036c9] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/init [0x10000003171] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cephadm/ [0x10000003047] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x100000035e3] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/erasure [0x10000003726] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/kv/ [0x100000036c8] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/init [0x100000035cb] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/doc/ [0x100000033f6] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/compressor/ [0x10000003b8c] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mstop.sh [0x10000003614] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/etc [0x10000003635] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/googletest/ [0x10000003734] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/libcephsqlite.cc [0x10000003956] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/loadclass.sh [0x10000003b91] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/neorados/ [0x10000003165] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph.conf.twoosds [0x10000003bde] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/osd/ [0x100000036f2] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/journal/ [0x10000003045] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x10000003b1c] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mount.fuse.ceph [0x10000003029] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/bash_completion/ [0x10000003b90] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/nasm [0x10000003597] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/crush/ [0x10000002ff3] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/auth/ [0x10000003725] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/krbd.cc [0x10000003b9b] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ocf/ [0x10000003636] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/include/ [0x10000003168] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_fuse.cc [0x10000003292] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cls_acl.cc [0x10000003042] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/boost_redis/ [0x10000003b1d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mount/ [0x1000000304a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x10000002ff2] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/arrow/ [0x10000003754] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/librbd/ [0x10000003736] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/librados [0x10000003b99] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/objsync/ [0x1000000316c] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_osd.cc [0x10000003294] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cmonctl [0x1000000374d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/libradosstriper/ [0x10000003044] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/c [0x1000000395f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mds/ [0x1000000361e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/extblkdev/ [0x1000000320a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cls/ [0x10000003735] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/libkmip/ [0x1000000316e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_syn.cc [0x1000000316b] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_mon.cc [0x10000002fe8] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/arch/ [0x10000002fe3] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/CMakeLists.txt [0x1000000370f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/json_spirit/ [0x1000000305b] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x10000003b24] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mrgw.sh [0x10000003b8e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mypy [0x1000000371e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/key_value_store/ [0x10000002fe5] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/SimpleRADOSStriper.cc [0x1000000305d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x10000003b96] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/objclass/ [0x100000031e3] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ckill.sh [0x1000000316f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_ver.c [0x10000003ade] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mon/ [0x10000003048] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x1000000316a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_mgr.cc [0x10000002fe6] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/SimpleRADOSStriper.h [0x100000035c9] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cstart.sh [0x100000031e4] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/client/ [0x10000002fe4] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/README [0x10000003b26] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/msg/ [0x10000003737] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/librados/ [0x10000003b8d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/multi [0x10000003169] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_mds.cc [0x1000000302f] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/blk/ [0x10000003295] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/common/ [0x10000003629] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/fmt/ [0x10000002fe2] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/.gitignore [0x1000000305c] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x10000003957] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/log/ [0x10000003166] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph.in [0x10000003293] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/cls_crypto.cc [0x1000000395e] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/logrotate.conf [0x100000039ec] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/messages/ [0x1000000362a] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/global/ [0x100000035b0] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/crypto/ [0x100000036ce] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/java/ [0x100000035da] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/dokan/ [0x10000003b25] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/mrun [0x10000003046] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x10000003059] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph [0x1000000316d] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_release [0x10000003170] since 2024-04-21 12:04:11.505000: /client.0/tmp/ceph/src/ceph_ver.h.in.cmake total pending: 108 complete (min/max/avg/stdev): 6175 (0.0/4.282/0.71/0.46)
We can see that all of the pending quiesces are under @[0x10000002fe1] /client.0/tmp/ceph/src
@. All inodes under the directory complain about the inability to authpin:
2024-04-21T12:04:11.506+0000 7fbb7d1d4640 10 mds.0.locker can't auth_pin (freezing?), waiting to authpin [inode 0x10000003733 [1fa,head] /client.0/tmp/ceph/src/libcephfs.cc auth v4191 DIRTYPARENT s=65791 n(v0 rc2024-04-21T12:03:56.550134+0000 b65791 1=1+0) caps={4753=pAsLsXsFsc/-@2},l=4753(-1) | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 dirty=0 authpin=0 0x558d7a5ba680]
The parent dir appears to be frozen:
2024-04-21T12:04:11.603+0000 7fbb769c7640 10 mds.0.cache.dir(0x10000002fe1) _freeze_dir [dir 0x10000002fe1 /client.0/tmp/ceph/src/ [2,head] auth v=5484 cv=5322/5322 ap=1+0 state=1610895377|complete|freezingdir|fragmenting|dnpinnedfrag f(v0 m2024-04-21T12:04:09.612818+0000 103=51+52) n(v4 rc2024-04-21T12:04:10.288801+0000 b23531464 3168=2900+268)/n(v4 rc2024-04-21T12:04:09.612818+0000 b23531464 3065=2799+266) hs=103+2,ss=0+0 dirty=2 | child=1 dirty=1 waiter=1 authpin=1 0x558d780fe400]
Actions