Project

General

Profile

Bug #53597

mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())

Added by 玮文 胡 over 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

# ceph crash info 2021-12-13T17:07:59.644235Z_674d6c2a-ec54-4bf3-a040-2a53ba7f93fe
{
    "assert_condition": "dir->get_projected_version() == dir->get_version()",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc",
    "assert_func": "void Migrator::encode_export_dir(ceph::bufferlist&, CDir*, std::map<client_t, entity_inst_t>&, std::map<client_t, client_metadata_t>&, uint64_t&)",
    "assert_line": 1753,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: In function 'void Migrator::encode_export_dir(ceph::bufferlist&, CDir*, std::map<client_t, entity_inst_t>&, std::map<client_t, client_metadata_t>&, uint64_t&)' thread 7f31ea44e700 time 2021-12-13T17:07:59.638997+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: 1753: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())\n",
    "assert_thread_name": "MR_Finisher",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f31f7e6fb20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f31f8e7ed1f]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276ee8) [0x7f31f8e7eee8]",
        "(Migrator::encode_export_dir(ceph::buffer::v15_2_0::list&, CDir*, std::map<client_t, entity_inst_t, std::less<client_t>, std::allocator<std::pair<client_t const, entity_inst_t> > >&, std::map<client_t, client_metadata_t, std::less<client_t>, std::allocator<std::pair<client_t const, client_metadata_t> > >&, unsigned long&)+0xbce) [0x55bca1101f4e]",
        "(Migrator::export_go_synced(CDir*, unsigned long)+0x52d) [0x55bca11024dd]",
        "(C_M_ExportGo::finish(int)+0x19) [0x55bca1128689]",
        "(MDSContext::complete(int)+0x56) [0x55bca1209906]",
        "(C_IO_Wrapper::finish(int)+0x12) [0x55bca120a622]",
        "(MDSContext::complete(int)+0x56) [0x55bca1209906]",
        "(MDSIOContextBase::complete(int)+0x5ac) [0x55bca120a13c]",
        "(C_IO_Wrapper::complete(int)+0x12d) [0x55bca120a56d]",
        "(Finisher::finisher_thread_entry()+0x1a5) [0x7f31f8f1e6d5]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f31f7e6514a]",
        "clone()" 
    ],
    "ceph_version": "16.2.6",
    "crash_id": "2021-12-13T17:07:59.644235Z_674d6c2a-ec54-4bf3-a040-2a53ba7f93fe",
    "entity_name": "mds.cephfs.gpu006.ddpekw",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "3f92007b85dc9e8d2220c46c5d3cfa748d5a1e634cd303ccd3e2bfc96ce02b3f",
    "timestamp": "2021-12-13T17:07:59.644235Z",
    "utsname_hostname": "gpu006",
    "utsname_machine": "x86_64",
    "utsname_release": "5.8.0-55-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#62~20.04.1-Ubuntu SMP Wed Jun 2 08:55:04 UTC 2021" 
}

This happens when we are trying to upgrade from 16.2.6 to .7 with cephadm, while reducing max_mds to 1, rank 1 repeatedly crash with this stack trace. And we cannot proceed.

Now we have paused the upgrade and reset max_mds to 2. And it at least stops crashing loop.

mds.log.gz - logs of crash after setting debug_mds to 1/20 (334 KB) 玮文 胡, 12/14/2021 01:57 AM


Related issues

Related to CephFS - Bug #62381: mds: Bug still exists: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) In Progress
Copied to CephFS - Backport #55928: quincy: mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) Resolved
Copied to CephFS - Backport #55929: pacific: mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) Resolved

History

#1 Updated by 玮文 胡 over 2 years ago

Attached the logs after setting debug_mds to 1/20. This may be the most interesting part:

debug    -11> 2021-12-14T01:45:40.325+0000 7f12c1c1c700  7 mds.1.mig export_go_synced [dir 0x200065b309d ~mds0/stray1/200065b309d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x55c0e3e23600] to 0
debug    -10> 2021-12-14T01:45:40.325+0000 7f12c1c1c700 10 mds.1.cache number of subtrees = 190; not printing subtrees
debug     -9> 2021-12-14T01:45:40.325+0000 7f12c1c1c700  7 mds.1.cache adjust_subtree_auth 1,1 -> 1,0 on [dir 0x200065b309d ~mds0/stray1/200065b309d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x55c0e3e23600]
debug     -8> 2021-12-14T01:45:40.325+0000 7f12c1c1c700 10 mds.1.cache number of subtrees = 190; not printing subtrees
debug     -7> 2021-12-14T01:45:40.325+0000 7f12c1c1c700  7 mds.1.cache  current root is [dir 0x200065b309d ~mds0/stray1/200065b309d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x55c0e3e23600]
debug     -6> 2021-12-14T01:45:40.325+0000 7f12c1c1c700 10 mds.1.cache.dir(0x200065b309d) setting dir_auth=1,0 from 1,1 on [dir 0x200065b309d ~mds0/stray1/200065b309d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x55c0e3e23600]
debug     -5> 2021-12-14T01:45:40.325+0000 7f12c1c1c700 10 mds.1.cache number of subtrees = 190; not printing subtrees
debug     -4> 2021-12-14T01:45:40.325+0000 7f12c1c1c700  7 mds.1.mig encode_export_dir [dir 0x200065b309d ~mds0/stray1/200065b309d/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,0 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x55c0e3e23600] 0 head items
debug     -3> 2021-12-14T01:45:40.329+0000 7f12c1c1c700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: In function 'void Migrator::encode_export_dir(ceph::bufferlist&, CDir*, std::map<client_t, entity_inst_t>&, std::map<client_t, client_metadata_t>&, uint64_t&)' thread 7f12c1c1c700 time 2021-12-14T01:45:40.329447+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: 1753: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())

It looks strange that mds.1 need to export a stray dir of mds.0?

#2 Updated by Xiubo Li over 2 years ago

If I didn't miss something important the system dirs shouldn't be migrated in theory.

#3 Updated by Xiubo Li over 2 years ago

The dir "200065b309d/" is already located in the stray/, I think it's queued for being purging, will it be possible that the MDS.1 hasn't get a chance to purge it, and need to export it to MDS.0 to do that ?

#4 Updated by 玮文 胡 over 2 years ago

This dir should have been deleted about one month ago. Just found that one client is still holding a cap on it. And I just killed the relevant process on the client. Now the cap has gone. But this dir still not goes away. And the rank 1 will still crash if I try to reduce max_mds to 1.

# ceph fs status
cephfs - 29 clients
======
RANK  STATE           MDS              ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  cephfs.gpu018.ovxvoz  Reqs: 1595 /s  16.9M  16.9M   305k  1768k
 1    active  cephfs.gpu006.ddpekw  Reqs:  433 /s  2085k  1902k   133k   923k
          POOL              TYPE     USED  AVAIL
   cephfs.cephfs.meta     metadata   130G   340G
   cephfs.cephfs.data       data     120T  15.6T
 cephfs.cephfs.data_ssd     data       0    511G
cephfs.cephfs.data_mixed    data    9.81T   340G
 cephfs.cephfs.data_ec      data     171G  25.0T
    STANDBY MDS
cephfs.gpu024.rpfbnh
cephfs.gpu023.aetiph
MDS version: ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
# ceph tell mds.cephfs.gpu018.ovxvoz dump inode 0x200065b309d
{
    "path": "~mds0/stray1/200065b309d",
    "ino": 2199129895069,
    "rdev": 0,
    "ctime": "2021-10-16T12:12:21.305939+0000",
    "btime": "2021-10-14T11:04:36.224144+0000",
    "mode": 16893,
    "uid": 859600034,
    "gid": 859600034,
    "nlink": 0,
    "dir_layout": {
        "dir_hash": 2,
        "unused1": 0,
        "unused2": 0,
        "unused3": 0
    },
    "layout": {
        "stripe_unit": 0,
        "stripe_count": 0,
        "object_size": 0,
        "pool_id": -1,
        "pool_ns": "" 
    },
    "old_pools": [],
    "size": 0,
    "truncate_seq": 1,
    "truncate_size": 18446744073709551615,
    "truncate_from": 0,
    "truncate_pending": 0,
    "mtime": "2021-10-16T12:12:21.305939+0000",
    "atime": "2021-10-14T11:04:36.224144+0000",
    "time_warp_seq": 0,
    "change_attr": 3,
    "export_pin": -1,
    "export_ephemeral_random_pin": 0,
    "export_ephemeral_distributed_pin": false,
    "client_ranges": [],
    "dirstat": {
        "version": 0,
        "mtime": "2021-10-16T12:12:21.305939+0000",
        "num_files": 0,
        "num_subdirs": 0,
        "change_attr": 2
    },
    "rstat": {
        "version": 0,
        "rbytes": 0,
        "rfiles": 0,
        "rsubdirs": 1,
        "rsnaps": 0,
        "rctime": "2021-10-16T12:12:21.305939+0000" 
    },
    "accounted_rstat": {
        "version": 0,
        "rbytes": 0,
        "rfiles": 0,
        "rsubdirs": 1,
        "rsnaps": 0,
        "rctime": "2021-10-16T12:12:21.305939+0000" 
    },
    "version": 82354523,
    "file_data_version": 0,
    "xattr_version": 1,
    "backtrace_version": 82354523,
    "stray_prior_path": "/home/yangyifan/yangyifan/.vscode-server/data/logs/20211014T190435",
    "max_size_ever": 0,
    "quota": {
        "max_bytes": 0,
        "max_files": 0
    },
    "last_scrub_stamp": "0.000000",
    "last_scrub_version": 0,
    "symlink": "",
    "xattrs": [],
    "dirfragtree": {
        "splits": []
    },
    "old_inodes": [],
    "oldest_snap": 18446744073709551614,
    "damage_flags": 0,
    "is_auth": true,
    "auth_state": {
        "replicas": {}
    },
    "replica_state": {
        "authority": [
            0,
            -2
        ],
        "replica_nonce": 0
    },
    "auth_pins": 0,
    "is_frozen": false,
    "is_freezing": false,
    "pins": {
        "request": 0,
        "lock": 0,
        "importingcaps": 0,
        "importing": 0,
        "dirfrag": 0,
        "caps": 0,
        "stickydirs": 0,
        "dirtyparent": 0,
        "dirwaiter": 0,
        "replicated": 0,
        "dirty": 0,
        "authpin": 0,
        "discoverbase": 0
    },
    "nref": 0,
    "versionlock": {
        "gather_set": [],
        "state": "lock",
        "is_leased": false,
        "num_rdlocks": 0,
        "num_wrlocks": 0,
        "num_xlocks": 0,
        "xlock_by": {}
    },
    "authlock": {},
    "linklock": {},
    "dirfragtreelock": {},
    "filelock": {},
    "xattrlock": {},
    "snaplock": {},
    "nestlock": {
        "gather_set": [],
        "state": "mix",
        "is_leased": false,
        "num_rdlocks": 0,
        "num_wrlocks": 0,
        "num_xlocks": 0,
        "xlock_by": {}
    },
    "flocklock": {},
    "policylock": {},
    "states": [
        "auth" 
    ],
    "client_caps": [],
    "loner": -1,
    "want_loner": 7977667,
    "mds_caps_wanted": []
}

# ceph tell mds.cephfs.gpu006.ddpekw dump inode 0x200065b309d
dump inode failed, wrong inode number or the inode is not cached

Let me know if there is anything else worth checking.

#5 Updated by Xiubo Li over 2 years ago

Are you using the fuse client or kclient ?

#6 Updated by 玮文 胡 over 2 years ago

Xiubo Li wrote:

Are you using the fuse client or kclient ?

Both. But I believe only kclient have ever accessed this inode (0x200065b309d).

Actually, we are not using fuse, we use cephfs python binding to export some `cephfs.dir.rbytes` to our Prometheus. I assume this is similar to fuse.

#7 Updated by Xiubo Li over 2 years ago

玮文 胡 wrote:

This dir should have been deleted about one month ago. Just found that one client is still holding a cap on it. And I just killed the relevant process on the client. Now the cap has gone. But this dir still not goes away. And the rank 1 will still crash if I try to reduce max_mds to 1.

BTW, what's caps did that client hold ?

#8 Updated by 玮文 胡 over 2 years ago

Xiubo Li wrote:

玮文 胡 wrote:

This dir should have been deleted about one month ago. Just found that one client is still holding a cap on it. And I just killed the relevant process on the client. Now the cap has gone. But this dir still not goes away. And the rank 1 will still crash if I try to reduce max_mds to 1.

BTW, what's caps did that client hold ?

This diff shows what has changed after I kill that process:

--- old.out     2021-12-14 06:55:53.588552894 +0000
+++ now.out     2021-12-14 06:56:58.869758404 +0000
@@ -98,7 +98,7 @@
         "importingcaps": 0,
         "importing": 0,
         "dirfrag": 0,
-        "caps": 1,
+        "caps": 0,
         "stickydirs": 0,
         "dirtyparent": 0,
         "dirwaiter": 0,
@@ -107,7 +107,7 @@
         "authpin": 0,
         "discoverbase": 0
     },
-    "nref": 1,
+    "nref": 0,
     "versionlock": {
         "gather_set": [],
         "state": "lock",
@@ -137,16 +137,8 @@
     "states": [
         "auth" 
     ],
-    "client_caps": [
-        {
-            "client_id": 7977667,
-            "pending": "pAsLsXs",
-            "issued": "pAsLsXs",
-            "wanted": "-",
-            "last_sent": 4
-        }
-    ],
-    "loner": 7977667,
+    "client_caps": [],
+    "loner": -1,
     "want_loner": 7977667,
     "mds_caps_wanted": []
 }

(It is lucky that I still keep that console open to retrieve the history.)

#9 Updated by 玮文 胡 over 2 years ago

There is one thing that looks strange to me. It is rank 1 that wants to export the inode to rank 0. But when I issue "dump inode" to rank 1, it reports the inode is not in cache. But this inode is in the cache of rank 0.

So if the inode is not in the cache of rank 1, it cannot purge the inode, right? Is there any way to force rank 1 to load this inode into the cache to give it a chance to purge the inode again? Or at least I can check how this inode looks like in rank 1's point of view with "dump inode".

#10 Updated by 玮文 胡 over 2 years ago

The inode 0x200065b309d has gone, I don't know how. But I got another inode that crashes the rank 1. It is very similar to the previous one:

--- old.out     2021-12-14 06:55:53.588552894 +0000
+++ old2.out    2021-12-14 17:12:25.868239901 +0000
@@ -1,12 +1,12 @@
 {
-    "path": "~mds0/stray1/200065b309d",
-    "ino": 2199129895069,
+    "path": "~mds0/stray1/20006fdf4cf",
+    "ino": 2199140562127,
     "rdev": 0,
-    "ctime": "2021-10-16T12:12:21.305939+0000",
-    "btime": "2021-10-14T11:04:36.224144+0000",
+    "ctime": "2021-11-13T07:40:12.204675+0000",
+    "btime": "2021-11-11T02:44:08.380744+0000",
     "mode": 16893,
-    "uid": 859600034,
-    "gid": 859600034,
+    "uid": 859600023,
+    "gid": 859600023,
     "nlink": 0,
     "dir_layout": {
         "dir_hash": 2,
@@ -27,8 +27,8 @@
     "truncate_size": 18446744073709551615,
     "truncate_from": 0,
     "truncate_pending": 0,
-    "mtime": "2021-10-16T12:12:21.305939+0000",
-    "atime": "2021-10-14T11:04:36.224144+0000",
+    "mtime": "2021-11-13T07:40:12.200675+0000",
+    "atime": "2021-11-11T02:44:08.380744+0000",
     "time_warp_seq": 0,
     "change_attr": 3,
     "export_pin": -1,
@@ -37,7 +37,7 @@
     "client_ranges": [],
     "dirstat": {
         "version": 0,
-        "mtime": "2021-10-16T12:12:21.305939+0000",
+        "mtime": "2021-11-13T07:40:12.200675+0000",
         "num_files": 0,
         "num_subdirs": 0,
         "change_attr": 2
@@ -48,7 +48,7 @@
         "rfiles": 0,
         "rsubdirs": 1,
         "rsnaps": 0,
-        "rctime": "2021-10-16T12:12:21.305939+0000" 
+        "rctime": "2021-11-13T07:40:12.204675+0000" 
     },
     "accounted_rstat": {
         "version": 0,
@@ -56,13 +56,13 @@
         "rfiles": 0,
         "rsubdirs": 1,
         "rsnaps": 0,
-        "rctime": "2021-10-16T12:12:21.305939+0000" 
+        "rctime": "2021-11-13T07:40:12.204675+0000" 
     },
-    "version": 82354523,
+    "version": 82361701,
     "file_data_version": 0,
     "xattr_version": 1,
-    "backtrace_version": 82354523,
-    "stray_prior_path": "/home/yangyifan/yangyifan/.vscode-server/data/logs/20211014T190435",
+    "backtrace_version": 82361701,
+    "stray_prior_path": "/home/dengzeshuai/.vscode-server/data/logs/20211111T104407",
     "max_size_ever": 0,
     "quota": {
         "max_bytes": 0,
@@ -139,14 +139,14 @@
     ],
     "client_caps": [
         {
-            "client_id": 7977667,
+            "client_id": 8094446,
             "pending": "pAsLsXs",
             "issued": "pAsLsXs",
             "wanted": "-",
             "last_sent": 4
         }
     ],
-    "loner": 7977667,
-    "want_loner": 7977667,
+    "loner": 8094446,
+    "want_loner": 8094446,
     "mds_caps_wanted": []
 }

It is also related to vscode server, but from different user, and the cap is hold by different client. To help debugging this issue, is there anything that you want me to check before I kill the process that holds this cap?

#11 Updated by 玮文 胡 over 2 years ago

BTW, is there any way to traverse all the inodes in the stray dir, so that I can find out all such stall caps in one pass? "ceph tell mds.cephfs.gpu018.ovxvoz dump tree '~mds0/stray1'" returns "root inode is not in cache"

#12 Updated by 玮文 胡 over 2 years ago

Though almost identical, here are the logs before the crash in the new case.

debug     -9> 2021-12-14T17:06:19.332+0000 7f64edb19700  7 mds.1.mig export_go_synced [dir 0x20006fdf4cf ~mds0/stray1/20006fdf4cf/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x56453aa0f600] to 0
debug     -8> 2021-12-14T17:06:19.332+0000 7f64edb19700 10 mds.1.cache number of subtrees = 124; not printing subtrees
debug     -7> 2021-12-14T17:06:19.332+0000 7f64edb19700  7 mds.1.cache adjust_subtree_auth 1,1 -> 1,0 on [dir 0x20006fdf4cf ~mds0/stray1/20006fdf4cf/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x56453aa0f600]
debug     -6> 2021-12-14T17:06:19.332+0000 7f64edb19700 10 mds.1.cache number of subtrees = 124; not printing subtrees
debug     -5> 2021-12-14T17:06:19.332+0000 7f64edb19700  7 mds.1.cache  current root is [dir 0x20006fdf4cf ~mds0/stray1/20006fdf4cf/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x56453aa0f600]
debug     -4> 2021-12-14T17:06:19.332+0000 7f64edb19700 10 mds.1.cache.dir(0x20006fdf4cf) setting dir_auth=1,0 from 1,1 on [dir 0x20006fdf4cf ~mds0/stray1/20006fdf4cf/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x56453aa0f600]
debug     -3> 2021-12-14T17:06:19.332+0000 7f64edb19700 10 mds.1.cache number of subtrees = 124; not printing subtrees
debug     -2> 2021-12-14T17:06:19.332+0000 7f64edb19700  7 mds.1.mig encode_export_dir [dir 0x20006fdf4cf ~mds0/stray1/20006fdf4cf/ [2,head] auth{0=2} v=1 cv=0/0 dir_auth=1,0 state=1073742851|complete|frozentree|exporting f() n() hs=0+0,ss=0+0 | ptrwaiter=1 request=0 frozen=1 subtree=1 replicated=1 waiter=0 authpin=0 0x56453aa0f600] 0 head items
debug     -1> 2021-12-14T17:06:19.332+0000 7f64edb19700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: In function 'void Migrator::encode_export_dir(ceph::bufferlist&, CDir*, std::map<client_t, entity_inst_t>&, std::map<client_t, client_metadata_t>&, uint64_t&)' thread 7f64edb19700 time 2021-12-14T17:06:19.334858+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: 1753: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())

#13 Updated by Xiubo Li over 2 years ago

玮文 胡 wrote:

The inode 0x200065b309d has gone, I don't know how. But I got another inode that crashes the rank 1. It is very similar to the previous one:

I think it should take a while to expire and trim the mdlog segments, AFAIK, the 0x200065b309d in stray/ dir will block the mdlog to be expired and trimmed.

BTW, could your lasted of `ceph fs status` ?

[...]

It is also related to vscode server, but from different user, and the cap is hold by different client. To help debugging this issue, is there anything that you want me to check before I kill the process that holds this cap?

Yeah, could you show me the `rados -p cephfs.cephfs.meta ls` ? And for all the active MDSes' output of `./bin/cephfs-journal-tool --rank=cephfs.cephfs:0 header get` ?

You can try to flush the journal log for that MDS and wait for a while to check whether will it go aways. Please run the above commands again after you kill the process.

#14 Updated by Xiubo Li over 2 years ago

玮文 胡 wrote:

BTW, is there any way to traverse all the inodes in the stray dir, so that I can find out all such stall caps in one pass? "ceph tell mds.cephfs.gpu018.ovxvoz dump tree '~mds0/stray1'" returns "root inode is not in cache"

You should use the following command:

ceph daemon mds.cephfs.gpu018.ovxvoz dump tree ~mdsdir 3

Or

ceph tell mds.cephfs.gpu018.ovxvoz dump tree '~mdsdir/stray1'

It's a little strage that the code couldn't recognize the '~mds0/', which is show from the `get subtrees`.

#15 Updated by 玮文 胡 over 2 years ago

Xiubo Li wrote:

BTW, could your lasted of `ceph fs status` ?

I don't quite understand this. the output don't change much over time

# ceph fs status
cephfs - 29 clients
======
RANK  STATE           MDS              ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  cephfs.gpu018.ovxvoz  Reqs:  111 /s  16.3M  16.3M   347k  4336k
 1    active  cephfs.gpu006.ddpekw  Reqs:   23 /s  3941k  3539k   257k  1300k
          POOL              TYPE     USED  AVAIL
   cephfs.cephfs.meta     metadata   133G   334G
   cephfs.cephfs.data       data     123T  14.0T
 cephfs.cephfs.data_ssd     data       0    501G
cephfs.cephfs.data_mixed    data    9.81T   334G
 cephfs.cephfs.data_ec      data     178G  22.4T
    STANDBY MDS
cephfs.gpu024.rpfbnh
cephfs.gpu023.aetiph
MDS version: ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)

Yeah, could you show me the `rados -p cephfs.cephfs.meta ls` ?

That's a very long list (7425661 rows). I'm afraid that it does not fit here. what are you looking for?

# rados -p cephfs.cephfs.meta ls > cephfs.meta.ls
# grep 20006fdf4cf cephfs.meta.ls
20006fdf4cf.00000000
# grep -v -E '[0-9a-f]+\.[0-9a-f]{8}' cephfs.meta.ls
mds1_openfiles.b
mds0_openfiles.1e
mds0_openfiles.3
mds0_openfiles.1d
mds0_openfiles.19
mds0_openfiles.6
mds0_openfiles.b
mds1_openfiles.2
mds0_openfiles.2c
mds0_openfiles.9
mds0_openfiles.4
mds0_openfiles.28
mds0_openfiles.25
mds0_openfiles.11
mds0_openfiles.22
mds0_openfiles.1a
mds0_openfiles.1f
mds0_openfiles.2f
mds1_openfiles.0
mds0_openfiles.15
mds1_openfiles.8
mds1_openfiles.c
mds0_openfiles.0
mds1_openfiles.e
mds0_openfiles.26
mds0_openfiles.c
mds0_openfiles.10
mds0_openfiles.f
mds0_openfiles.20
mds0_openfiles.a
mds0_openfiles.2a
mds1_openfiles.1
mds0_openfiles.24
mds1_openfiles.9
mds0_openfiles.8
mds0_openfiles.14
mds0_openfiles.1c
mds0_openfiles.1b
mds0_openfiles.29
mds1_openfiles.4
mds0_openfiles.e
mds_snaptable
mds0_openfiles.23
mds0_openfiles.16
mds0_openfiles.17
mds0_openfiles.18
mds0_openfiles.7
mds1_sessionmap
mds0_openfiles.12
mds0_inotable
mds0_openfiles.1
mds0_openfiles.d
mds0_openfiles.2
mds0_sessionmap
mds1_openfiles.3
mds0_openfiles.21
mds0_openfiles.2d
mds0_openfiles.2b
mds1_openfiles.7
mds1_inotable
mds1_openfiles.d
mds1_openfiles.6
mds1_openfiles.5
mds0_openfiles.27
mds0_openfiles.5
mds0_openfiles.13
mds0_openfiles.2e
mds1_openfiles.a

And for all the active MDSes' output of `./bin/cephfs-journal-tool --rank=cephfs.cephfs:0 header get` ?

# cephfs-journal-tool --rank=cephfs:0 header get
{
    "magic": "ceph fs volume v011",
    "write_pos": 6590283530699,
    "expire_pos": 6590043545004,
    "trimmed_pos": 6590035918848,
    "stream_format": 1,
    "layout": {
        "stripe_unit": 4194304,
        "stripe_count": 1,
        "object_size": 4194304,
        "pool_id": 4,
        "pool_ns": "" 
    }
}
# cephfs-journal-tool --rank=cephfs:1 header get
{
    "magic": "ceph fs volume v011",
    "write_pos": 2888050916279,
    "expire_pos": 2887870458493,
    "trimmed_pos": 2887857995776,
    "stream_format": 1,
    "layout": {
        "stripe_unit": 4194304,
        "stripe_count": 1,
        "object_size": 4194304,
        "pool_id": 4,
        "pool_ns": "" 
    }
}

You can try to flush the journal log for that MDS and wait for a while to check whether will it go aways.

# ceph tell mds.cephfs.gpu018.ovxvoz flush journal
{
    "message": "",
    "return_code": 0
}
# ceph tell mds.cephfs.gpu006.ddpekw flush journal
{
    "message": "",
    "return_code": 0
}
# cephfs-journal-tool --rank=cephfs:0 header get
{
    "magic": "ceph fs volume v011",
    "write_pos": 6590290964693,
    "expire_pos": 6590288918674,
    "trimmed_pos": 6590287577088,
    "stream_format": 1,
    "layout": {
        "stripe_unit": 4194304,
        "stripe_count": 1,
        "object_size": 4194304,
        "pool_id": 4,
        "pool_ns": "" 
    }
}
# cephfs-journal-tool --rank=cephfs:1 header get
{
    "magic": "ceph fs volume v011",
    "write_pos": 2888690565284,
    "expire_pos": 2888514323589,
    "trimmed_pos": 2888495529984,
    "stream_format": 1,
    "layout": {
        "stripe_unit": 4194304,
        "stripe_count": 1,
        "object_size": 4194304,
        "pool_id": 4,
        "pool_ns": "" 
    }
}

Above is the output before I clear the cap. So as expected, the inode 20006fdf4cf is still there.

Please run the above commands again after you kill the process.

Now I kill the process. The cap is gone as expected.

# ceph tell mds.cephfs.gpu018.ovxvoz flush journal
{
    "message": "",
    "return_code": 0
}
# ceph tell mds.cephfs.gpu006.ddpekw flush journal
{
    "message": "",
    "return_code": 0
}
# cephfs-journal-tool --rank=cephfs:0 header get
{
    "magic": "ceph fs volume v011",
    "write_pos": 6590306115467,
    "expire_pos": 6590304978010,
    "trimmed_pos": 6590304354304,
    "stream_format": 1,
    "layout": {
        "stripe_unit": 4194304,
        "stripe_count": 1,
        "object_size": 4194304,
        "pool_id": 4,
        "pool_ns": "" 
    }
}
# cephfs-journal-tool --rank=cephfs:1 header get
{
    "magic": "ceph fs volume v011",
    "write_pos": 2890949195703,
    "expire_pos": 2890813814352,
    "trimmed_pos": 2890810785792,
    "stream_format": 1,
    "layout": {
        "stripe_unit": 4194304,
        "stripe_count": 1,
        "object_size": 4194304,
        "pool_id": 4,
        "pool_ns": "" 
    }
}

After that I have waited for 10 min. The inode (0x20006fdf4cf) is still there.

#16 Updated by Xiubo Li over 2 years ago

玮文 胡 wrote:

Xiubo Li wrote:

BTW, could your lasted of `ceph fs status` ?

I don't quite understand this. the output don't change much over time
[...]

Yeah, could you show me the `rados -p cephfs.cephfs.meta ls` ?

That's a very long list (7425661 rows). I'm afraid that it does not fit here. what are you looking for?

The above outputs I just wanted to verify another metadata pool is filled full of journal logs issue in this case. The above is enough and this won't case it.

[...]

And for all the active MDSes' output of `./bin/cephfs-journal-tool --rank=cephfs.cephfs:0 header get` ?

[...]

You can try to flush the journal log for that MDS and wait for a while to check whether will it go aways.

[...]

Above is the output before I clear the cap. So as expected, the inode 20006fdf4cf is still there.

Please run the above commands again after you kill the process.

Now I kill the process. The cap is gone as expected.

[...]

After that I have waited for 10 min. The inode (0x20006fdf4cf) is still there.

Do you have the logs when the last inode disappeared ?

#17 Updated by 玮文 胡 over 2 years ago

Xiubo Li wrote:

Do you have the logs when the last inode disappeared ?

No, I only have debug_mds set to 1/5. I can try to capture some logs this time. Would 5/5 be enough?

#18 Updated by Xiubo Li over 2 years ago

玮文 胡 wrote:

Xiubo Li wrote:

Do you have the logs when the last inode disappeared ?

No, I only have debug_mds set to 1/5. I can try to capture some logs this time. Would 5/5 be enough?

The 20/1 should be better if possible or 15/1.

#19 Updated by 玮文 胡 over 2 years ago

Do you have the logs when the last inode disappeared ?

I got the log for inode 0x20006fdf4cf being purged. log level: debug_mds 15/5, `grep -5 -E 'trim_lru|20006fdf4cf' <log filename>`

2021-12-15T13:41:36.293+0000 7fa42e959700 10 mds.0.cache.dir(0x600) auth_pin by 0x564db802df80 on [dir 0x600 ~mds0/stray0/ [2,head] auth{1=2} v=82496944 cv=82495210/82495210 ap=3+0 state=1610883072|sticky f(v10 m2021-12-15T13:41:09.493978+0000 6511=6506+5)/f(v10 m2021-12-15T13:41:09.493978+0000 6594=6587+7) n(v41 rc2021-12-15T13:41:09.493978+0000 b951929601 6511=6506+5)/n(v41 rc2021-12-15T13:41:09.493978+0000 b4402296790 6594=6587+7) hs=41+1,ss=0+0 dirty=8 | child=1 sticky=1 replicated=1 dirty=1 waiter=0 authpin=1 0x564dd072fa80] count now 3
2021-12-15T13:41:36.293+0000 7fa42e959700 10 mds.0.cache.strays purge [dentry #0x100/stray0/1000f9e1c03 [2,head] auth (dversion lock) v=82496775 ino=0x1000f9e1c03 state=1879048245|new | request=0 lock=0 inodepin=1 purging=1 dirty=1 authpin=0 0x5656addcd180] [inode 0x1000f9e1c03 [...2,head] ~mds0/stray0/1000f9e1c03/ auth v82496775 f(v0 m2021-12-15T13:41:09.445977+0000) n(v29 rc2021-12-15T13:41:09.493978+0000 1=0+1) (inest lock) (iversion lock) | dirtyscattered=0 request=0 lock=0 dirfrag=0 caps=0 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=0 0x564ed21ad180]
2021-12-15T13:41:36.293+0000 7fa42e959700  4 mds.0.purge_queue push: pushing inode 0x1000f9e1c03
2021-12-15T13:41:36.293+0000 7fa42e959700 10 mds.0.purge_queue _consume:  not readable right now
2021-12-15T13:41:36.293+0000 7fa42e959700 10 mds.0.cache.strays enqueue: purging this dentry immediately: [dentry #0x100/stray0/1000f9e1c03 [2,head] auth (dversion lock) v=82496775 ino=0x1000f9e1c03 state=1879048245|new | request=0 lock=0 inodepin=1 purging=1 dirty=1 authpin=0 0x5656addcd180]
2021-12-15T13:41:36.293+0000 7fa42e959700  7 mds.0.cache trim_lru trimming 0 items from LRU size=13015269 mid=7130069 pintail=525772 pinned=2829454
2021-12-15T13:41:36.293+0000 7fa42e959700 12 mds.0.cache trim_dentry [dentry #0x1/mixed/dataset/scannet/scans/scene0505_01/color/686.jpg [2,head] auth (dversion lock) v=34347 ino=0x10006dbb3e9 state=1073741824 0x565a5da3b680]
2021-12-15T13:41:36.293+0000 7fa42e959700 12 mds.0.cache  in container [dir 0x100005c2aab /mixed/dataset/ [2,head] auth v=1104797 cv=1104797/1104797 REP dir_auth=0 state=1074528256|auxsubtree f(v3 m2021-09-07T12:10:40.218007+0000 9=0+9) n(v46288 rc2021-12-14T10:35:32.438471+0000 b3504289605088 12929495=12863378+66117) hs=2+0,ss=0+0 | child=1 subtree=1 importbound=0 exportbound=0 sticky=0 subtreetemp=0 replicated=0 dirty=0 waiter=0 authpin=0 0x564db80b9200]
2021-12-15T13:41:36.293+0000 7fa42e959700 15 mds.0.cache trim_inode [inode 0x10006dbb3e9 [2,head] /mixed/dataset/scannet/scans/scene0505_01/color/686.jpg auth v34347 s=80141 n(v0 rc2021-12-14T09:51:41.503248+0000 b80141 1=1+0) (iversion lock) 0x5654f7654580]
2021-12-15T13:41:36.293+0000 7fa42e959700 12 mds.0.cache.dir(0x10006dbb13a) unlink_inode [dentry #0x1/mixed/dataset/scannet/scans/scene0505_01/color/686.jpg [2,head] auth (dversion lock) v=34347 ino=0x10006dbb3e9 state=1073741824 0x565a5da3b680] [inode 0x10006dbb3e9 [2,head] /mixed/dataset/scannet/scans/scene0505_01/color/686.jpg auth v34347 s=80141 n(v0 rc2021-12-14T09:51:41.503248+0000 b80141 1=1+0) (iversion lock) 0x5654f7654580]
2021-12-15T13:41:36.293+0000 7fa42e959700 14 mds.0.cache remove_inode [inode 0x10006dbb3e9 [2,head] #10006dbb3e9 auth v34347 s=80141 n(v0 rc2021-12-14T09:51:41.503248+0000 b80141 1=1+0) (iversion lock) 0x5654f7654580]
--
2021-12-15T13:41:40.081+0000 7fa42e959700 15 mds.0.cache trim_inode [inode 0x2000637af0b [...2,head] ~mds0/stray1/2000637af0b auth v82361699 snaprealm=0x564dce08d800 s=1269 n(v0 rc2021-10-28T13:55:55.994743+0000 b1269 1=1+0) (iversion lock) 0x565078d00b00]
2021-12-15T13:41:40.081+0000 7fa42e959700 12 mds.0.cache.dir(0x601) unlink_inode [dentry #0x100/stray1/2000637af0b [2,head] auth (dversion lock) v=82361699 ino=0x2000637af0b state=1073741824 0x564fda1aa000] [inode 0x2000637af0b [...2,head] ~mds0/stray1/2000637af0b auth v82361699 snaprealm=0x564dce08d800 s=1269 n(v0 rc2021-10-28T13:55:55.994743+0000 b1269 1=1+0) (iversion lock) 0x565078d00b00]
2021-12-15T13:41:40.081+0000 7fa42e959700 14 mds.0.cache remove_inode [inode 0x2000637af0b [...2,head] #2000637af0b auth v82361699 snaprealm=0x564dce08d800 s=1269 n(v0 rc2021-10-28T13:55:55.994743+0000 b1269 1=1+0) (iversion lock) 0x565078d00b00]
2021-12-15T13:41:40.081+0000 7fa42e959700 15 mds.0.cache.ino(0x2000637af0b) close_snaprealm snaprealm(0x2000637af0b seq 1 lc 0 cr 1 cps 2 snaps={} past_parent_snaps= global  0x564dce08d800)
2021-12-15T13:41:40.081+0000 7fa42e959700 12 mds.0.cache.dir(0x601) remove_dentry [dentry #0x100/stray1/2000637af0b [2,head] auth NULL (dversion lock) v=82361699 ino=(nil) state=1073741824 0x564fda1aa000]
2021-12-15T13:41:40.081+0000 7fa42e959700 12 mds.0.cache trim_dentry [dentry #0x100/stray1/20006fdf4cf [2,head] auth (dversion lock) v=82361701 ino=0x20006fdf4cf state=1073741824 0x5656b7019b80]
2021-12-15T13:41:40.081+0000 7fa42e959700 12 mds.0.cache  in container [dir 0x100 ~mds0/ [2,head] auth{1=7} v=168523061 cv=168522877/168522877 dir_auth=0 state=1610612737|complete f(v0 10=0+10) n(v56788 rc2021-12-15T13:41:09.493978+0000 b25102366891 23993=23942+51) hs=10+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 replicated=1 dirty=1 waiter=0 authpin=0 0x564db80b8000]
2021-12-15T13:41:40.081+0000 7fa42e959700 15 mds.0.cache trim_inode [inode 0x20006fdf4cf [...2,head] ~mds0/stray1/20006fdf4cf/ auth v82361701 f(v0 m2021-11-13T07:40:12.200675+0000) n(v0 rc2021-11-13T07:40:12.204675+0000 1=0+1) (inest mix) (iversion lock) 0x5653ac080100]
2021-12-15T13:41:40.081+0000 7fa42e959700 15 mds.0.cache trim_dirfrag [dir 0x20006fdf4cf ~mds0/stray1/20006fdf4cf/ [2,head] auth v=1 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x565cb2034880]
2021-12-15T13:41:40.081+0000 7fa42e959700 14 mds.0.cache.ino(0x20006fdf4cf) close_dirfrag *
2021-12-15T13:41:40.081+0000 7fa42e959700 12 mds.0.cache.dir(0x20006fdf4cf) remove_null_dentries [dir 0x20006fdf4cf ~mds0/stray1/20006fdf4cf/ [2,head] auth v=1 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x565cb2034880]
2021-12-15T13:41:40.081+0000 7fa42e959700 10 mds.0.cache.strays eval_stray [dentry #0x100/stray1/20006fdf4cf [2,head] auth (dversion lock) v=82361701 ino=0x20006fdf4cf state=1073741840 0x5656b7019b80]
2021-12-15T13:41:40.081+0000 7fa42e959700 10 mds.0.cache.strays  inode is [inode 0x20006fdf4cf [...2,head] ~mds0/stray1/20006fdf4cf/ auth v82361701 f(v0 m2021-11-13T07:40:12.200675+0000) n(v0 rc2021-11-13T07:40:12.204675+0000 1=0+1) (inest mix) (iversion lock) 0x5653ac080100]
2021-12-15T13:41:40.081+0000 7fa42e959700 10 mds.0.cache clear_dirty_bits_for_stray [inode 0x20006fdf4cf [...2,head] ~mds0/stray1/20006fdf4cf/ auth v82361701 f(v0 m2021-11-13T07:40:12.200675+0000) n(v0 rc2021-11-13T07:40:12.204675+0000 1=0+1) (inest mix) (iversion lock) 0x5653ac080100]
2021-12-15T13:41:40.081+0000 7fa42e959700 10 mds.0.cache.dir(0x601) auth_pin by 0x564db802df80 on [dir 0x601 ~mds0/stray1/ [2,head] auth{1=60} v=82363724 cv=82363326/82363326 ap=1+0 state=1610883072|sticky f(v10 m2021-12-15T12:29:00.246584+0000 4994=4990+4)/f(v10 m2021-12-15T12:29:00.246584+0000 5092=5085+7) n(v46 rc2021-12-15T12:29:00.246584+0000 b308661142 4994=4990+4)/n(v46 rc2021-12-15T12:29:00.246584+0000 b3299691802 5092=5085+7) hs=86+3,ss=0+0 dirty=3 | child=1 sticky=1 replicated=1 dirty=1 waiter=0 authpin=1 0x564dd072f600] count now 1
2021-12-15T13:41:40.081+0000 7fa42e959700 10 mds.0.cache.strays purge [dentry #0x100/stray1/20006fdf4cf [2,head] auth (dversion lock) v=82361701 ino=0x20006fdf4cf state=1073741876 | request=0 lock=0 inodepin=0 purging=1 replicated=0 dirty=0 authpin=0 0x5656b7019b80] [inode 0x20006fdf4cf [...2,head] ~mds0/stray1/20006fdf4cf/ auth v82361701 f(v0 m2021-11-13T07:40:12.200675+0000) n(v0 rc2021-11-13T07:40:12.204675+0000 1=0+1) (inest mix) (iversion lock) 0x5653ac080100]
2021-12-15T13:41:40.081+0000 7fa42e959700  4 mds.0.purge_queue push: pushing inode 0x20006fdf4cf
2021-12-15T13:41:40.081+0000 7fa42e959700 10 mds.0.purge_queue _consume:  not readable right now
2021-12-15T13:41:40.081+0000 7fa42e959700 10 mds.0.cache.strays enqueue: purging this dentry immediately: [dentry #0x100/stray1/20006fdf4cf [2,head] auth (dversion lock) v=82361701 ino=0x20006fdf4cf state=1073741876 | request=0 lock=0 inodepin=0 purging=1 replicated=0 dirty=0 authpin=0 0x5656b7019b80]
2021-12-15T13:41:40.081+0000 7fa42e959700 12 mds.0.cache trim_dentry [dentry #0x100/stray1/200035fcc1c [2,head] auth (dversion lock) v=82361635 ino=0x200035fcc1c state=1073741824 0x5650ae9e3b80]
2021-12-15T13:41:40.081+0000 7fa42e959700 12 mds.0.cache  in container [dir 0x100 ~mds0/ [2,head] auth{1=7} v=168523061 cv=168522877/168522877 dir_auth=0 state=1610612737|complete f(v0 10=0+10) n(v56788 rc2021-12-15T13:41:09.493978+0000 b25102366891 23993=23942+51) hs=10+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 replicated=1 dirty=1 waiter=0 authpin=0 0x564db80b8000]
2021-12-15T13:41:40.081+0000 7fa42e959700 15 mds.0.cache trim_inode [inode 0x200035fcc1c [...2,head] ~mds0/stray1/200035fcc1c auth v82361635 snaprealm=0x56521187de00 s=37 nl=2 n(v0 rc2021-10-15T12:36:22.120603+0000 b37 1=1+0) (iversion lock) 0x5658d7c99600]
2021-12-15T13:41:40.081+0000 7fa42e959700 12 mds.0.cache.dir(0x601) unlink_inode [dentry #0x100/stray1/200035fcc1c [2,head] auth (dversion lock) v=82361635 ino=0x200035fcc1c state=1073741824 0x5650ae9e3b80] [inode 0x200035fcc1c [...2,head] ~mds0/stray1/200035fcc1c auth v82361635 snaprealm=0x56521187de00 s=37 nl=2 n(v0 rc2021-10-15T12:36:22.120603+0000 b37 1=1+0) (iversion lock) 0x5658d7c99600]
2021-12-15T13:41:40.081+0000 7fa42e959700 14 mds.0.cache remove_inode [inode 0x200035fcc1c [...2,head] #200035fcc1c auth v82361635 snaprealm=0x56521187de00 s=37 nl=2 n(v0 rc2021-10-15T12:36:22.120603+0000 b37 1=1+0) (iversion lock) 0x5658d7c99600]
--
2021-12-15T13:41:49.305+0000 7fa42e959700 12 mds.0.cache  in container [dir 0x100005c2aab /mixed/dataset/ [2,head] auth v=1104797 cv=1104797/1104797 REP dir_auth=0 state=1074528256|auxsubtree f(v3 m2021-09-07T12:10:40.218007+0000 9=0+9) n(v46288 rc2021-12-14T10:35:32.438471+0000 b3504289605088 12929495=12863378+66117) hs=2+0,ss=0+0 | child=1 subtree=1 importbound=0 exportbound=0 sticky=0 subtreetemp=0 replicated=0 dirty=0 waiter=0 authpin=0 0x564db80b9200]
2021-12-15T13:41:49.305+0000 7fa42e959700 15 mds.0.cache trim_inode [inode 0x100074504bc [2,head] /mixed/dataset/scannet/scans/scene0435_02/depth/1572.png auth v14215 s=61234 n(v0 rc2021-12-14T09:53:06.568774+0000 b61234 1=1+0) (iversion lock) 0x565649836b00]
2021-12-15T13:41:49.305+0000 7fa42e959700 12 mds.0.cache.dir(0x1000744b4dc) unlink_inode [dentry #0x1/mixed/dataset/scannet/scans/scene0435_02/depth/1572.png [2,head] auth (dversion lock) v=14215 ino=0x100074504bc state=1073741824 0x56567889b900] [inode 0x100074504bc [2,head] /mixed/dataset/scannet/scans/scene0435_02/depth/1572.png auth v14215 s=61234 n(v0 rc2021-12-14T09:53:06.568774+0000 b61234 1=1+0) (iversion lock) 0x565649836b00]
2021-12-15T13:41:49.305+0000 7fa42e959700 14 mds.0.cache remove_inode [inode 0x100074504bc [2,head] #100074504bc auth v14215 s=61234 n(v0 rc2021-12-14T09:53:06.568774+0000 b61234 1=1+0) (iversion lock) 0x565649836b00]
2021-12-15T13:41:49.305+0000 7fa42e959700 12 mds.0.cache.dir(0x1000744b4dc) remove_dentry [dentry #0x1/mixed/dataset/scannet/scans/scene0435_02/depth/1572.png [2,head] auth NULL (dversion lock) v=14215 ino=(nil) state=1073741824 0x56567889b900]
2021-12-15T13:41:49.305+0000 7fa42e959700  7 mds.0.cache trim_lru trimmed 219536 items
2021-12-15T13:41:49.305+0000 7fa42e959700  7 mds.0.cache sending cache_expire to 1
2021-12-15T13:41:49.305+0000 7fa42e959700  7 mds.0.server recall_client_state: min=100 max=8388608 total=3624721 flags=0xe
2021-12-15T13:41:49.305+0000 7fa42e959700 10 mds.0.server recall_client_state: session client.7976747 v1:202.38.247.226:0/1930470818 caps 1285242, leases 9
2021-12-15T13:41:49.305+0000 7fa42e959700  7 mds.0.server   recalling 30000 caps; session_recall_throttle = 2; global_recall_throttle = 56
2021-12-15T13:41:49.305+0000 7fa42e959700 10 mds.0.54445 send_message_client client.7976747 v1:202.38.247.226:0/1930470818 client_session(recall_state max_caps 1255242 max_leases 0) v4
--
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.cache.den(0x600 1000f9e1c03) pre_dirty [dentry #0x100/stray0/1000f9e1c03 [2,head] auth (dversion lock) pv=82496949 v=82496775 ino=0x1000f9e1c03 state=1879048229|new | request=0 lock=0 inodepin=1 purging=1 dirty=1 authpin=0 0x5656addcd180]
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.cache.dir(0x600) project_fnode 0x56591af88020
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.cache.dir(0x600) pre_dirty 82496950
2021-12-15T13:41:49.357+0000 7fa42d957700 10 MDSIOContextBase::complete: 21C_IO_PurgeStrayPurged
2021-12-15T13:41:49.357+0000 7fa42d957700 10 MDSContext::complete: 21C_IO_PurgeStrayPurged
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.cache.strays _purge_stray_purged [dentry #0x100/stray1/20006fdf4cf [2,head] auth (dversion lock) v=82361701 ino=0x20006fdf4cf state=1073741860 | request=0 lock=0 inodepin=0 purging=1 replicated=0 dirty=0 authpin=0 0x5656b7019b80] [inode 0x20006fdf4cf [...2,head] ~mds0/stray1/20006fdf4cf/ auth v82361701 f(v0 m2021-11-13T07:40:12.200675+0000) n(v0 rc2021-11-13T07:40:12.204675+0000 1=0+1) (inest mix) (iversion lock) 0x5653ac080100]
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.cache.dir(0x601) pre_dirty 82363725
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.cache.den(0x601 20006fdf4cf) pre_dirty [dentry #0x100/stray1/20006fdf4cf [2,head] auth (dversion lock) pv=82363725 v=82361701 ino=0x20006fdf4cf state=1073741860 | request=0 lock=0 inodepin=0 purging=1 replicated=0 dirty=0 authpin=0 0x5656b7019b80]
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.cache.dir(0x601) project_fnode 0x565abeb16c80
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.cache.dir(0x601) pre_dirty 82363726
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.purge_queue _commit_ops: PURGE_FILE
2021-12-15T13:41:49.357+0000 7fa42c955700 10 MDSContext::complete: 18C_Locker_ScatterWB
2021-12-15T13:41:49.357+0000 7fa42b953700  5 mds.0.log _submit_thread 6596071845421~579 : EUpdate purge_stray [metablob 0x600, 1 dirs]
2021-12-15T13:41:49.357+0000 7fa42b953700  5 mds.0.log _submit_thread 6596071846020~579 : EUpdate purge_stray [metablob 0x601, 1 dirs]
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.purge_queue _execute_item:  remove dirfrag 1000f9e1c04.00000000
2021-12-15T13:41:49.357+0000 7fa42c955700 10 mds.0.locker scatter_writebehind_finish on (inest lock->sync w=1 dirty flushing) on [inode 0x1000888c883 [...2,head] /dataset/VLN-CE/result/jidongyu/ auth v3872403 pv3872405 ap=1 f(v0 m2021-10-26T11:42:26.351487+0000 36=0+36) n(v141832 rc2021-12-15T13:41:25.088058+0000 b7631235525888 10350=8400+1950) (isnap sync r=4) (inest lock->sync w=1 dirty flushing) (ipolicy sync r=4) (iversion lock) caps={7794103=pAsLsXsFs/-@2,7794480=pAsLsXsFs/-@2,7854301=pAsLsXsFs/-@2,7976767=pAsLsXsFs/AsLsXsFsx@3,8094446=pAsLsXsFs/-@16,8115205=pAsLsXsFs/AsLsXsFsx@2,8134535=pAsLsXsFs/-@2} | dirtyscattered=1 request=0 lock=3 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=0 authpin=1 0x564db9a0d180]
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.purge_queue _execute_item:  remove dirfrag 1000f9e1c03.00000000
2021-12-15T13:41:49.357+0000 7fa42c955700 15 mds.0.cache.ino(0x1000888c882) pop_and_dirty_projected_inode v3611392
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.purge_queue _execute_item:  remove dirfrag 20006fdf4cf.00000000
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.purge_queue _consume:  not readable right now
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.purge_queue _commit_ops: PURGE_DIR
2021-12-15T13:41:49.357+0000 7fa42c955700 10 mds.0.cache.ino(0x1000888c882) mark_dirty [inode 0x1000888c882 [...2,head] /dataset/VLN-CE/result/ auth v3611392 ap=2 f(v0 m2021-09-14T13:15:33.993729+0000 3=0+3) n(v208168 rc2021-12-15T13:41:25.088058+0000 b8183526684874 12513=10162+2351)/n(v208168 rc2021-12-15T13:41:25.088058+0000 b8183526683964 12513=10162+2351) (isnap sync r=4) (inest lock->sync w=1 dirty) (ipolicy sync r=4) (iversion lock w=1) caps={7794103=pAsLsXsFs/-@2,7794480=pAsLsXsFs/-@2,7854301=pAsLsXsFs/-@2,7976767=pAsLsXsFs/AsLsXsFsx@5,8048885=pAsLsXsFs/-@2,8094446=pAsLsXsFs/-@2,8115205=pAsLsXsFs/AsLsXsFsx@3,8134535=pAsLsXsFs/-@2} | dirtyscattered=1 request=0 lock=4 dirfrag=1 caps=1 dirtyrstat=1 dirty=1 waiter=0 authpin=1 0x564db9a0cc00]
2021-12-15T13:41:49.357+0000 7fa42c955700 10 mds.0.cache.den(0x100082cbcdc result) mark_dirty [dentry #0x1/dataset/VLN-CE/result [2,head] auth (dversion lock) pv=3611392 v=3611390 ino=0x1000888c882 state=1610612736 | request=0 lock=0 inodepin=1 dirty=1 authpin=0 0x564db9a10000]
2021-12-15T13:41:49.357+0000 7fa42d957700 10 mds.0.purge_queue _commit_ops: PURGE_DIR
--
2021-12-15T13:41:49.377+0000 7fa432961700 10 mds.0.locker simple_eval (iauth sync) on [inode 0x100009ff779 [2,head] /mixed/dataset/imagenet/train/n01698640/n01698640_21991.JPEG auth v6082 s=151207 n(v0 rc2020-11-13T11:54:03.028456+0000 b151207 1=1+0) (iversion lock) caps={7976747=pAsLsXsFscr/pFscr@1},l=7976747 | request=0 lock=0 caps=1 authpin=0 0x565816747700]
2021-12-15T13:41:49.377+0000 7fa432961700 10 mds.0.locker simple_eval (ilink sync) on [inode 0x100009ff779 [2,head] /mixed/dataset/imagenet/train/n01698640/n01698640_21991.JPEG auth v6082 s=151207 n(v0 rc2020-11-13T11:54:03.028456+0000 b151207 1=1+0) (iversion lock) caps={7976747=pAsLsXsFscr/pFscr@1},l=7976747 | request=0 lock=0 caps=1 authpin=0 0x565816747700]
2021-12-15T13:41:49.377+0000 7fa42d957700 10 mds.0.purge_queue _execute_item_complete: complete at 0x17ee994b5
2021-12-15T13:41:49.377+0000 7fa432961700 10 mds.0.locker simple_eval (ixattr sync) on [inode 0x100009ff779 [2,head] /mixed/dataset/imagenet/train/n01698640/n01698640_21991.JPEG auth v6082 s=151207 n(v0 rc2020-11-13T11:54:03.028456+0000 b151207 1=1+0) (iversion lock) caps={7976747=pAsLsXsFscr/pFscr@1},l=7976747 | request=0 lock=0 caps=1 authpin=0 0x565816747700]
2021-12-15T13:41:49.377+0000 7fa42d957700 10 mds.0.purge_queue _execute_item_complete: expiring to 0x17ee994b5
2021-12-15T13:41:49.377+0000 7fa42d957700 10 mds.0.purge_queue _execute_item_complete: completed item for ino 0x20006fdf4cf
2021-12-15T13:41:49.377+0000 7fa42d957700 10 mds.0.purge_queue _execute_item_complete: in_flight.size() now 0
2021-12-15T13:41:49.377+0000 7fa432961700 10 mds.0.locker scatter_eval (inest lock) on [inode 0x100009ff779 [2,head] /mixed/dataset/imagenet/train/n01698640/n01698640_21991.JPEG auth v6082 s=151207 n(v0 rc2020-11-13T11:54:03.028456+0000 b151207 1=1+0) (iversion lock) caps={7976747=pAsLsXsFscr/pFscr@1},l=7976747 | request=0 lock=0 caps=1 authpin=0 0x565816747700]
2021-12-15T13:41:49.377+0000 7fa42d957700 10 mds.0.purge_queue _execute_item_complete: left purge items in journal: 0 (purge_item_journal_size/write_pos/read_pos/expire_pos) now at (101/6424204469/6424204469/6424204469)
2021-12-15T13:41:49.377+0000 7fa42d957700 10 mds.0.purge_queue _consume:  not readable right now
2021-12-15T13:41:49.377+0000 7fa432961700 10 mds.0.locker simple_eval (iflock sync) on [inode 0x100009ff779 [2,head] /mixed/dataset/imagenet/train/n01698640/n01698640_21991.JPEG auth v6082 s=151207 n(v0 rc2020-11-13T11:54:03.028456+0000 b151207 1=1+0) (iversion lock) caps={7976747=pAsLsXsFscr/pFscr@1},l=7976747 | request=0 lock=0 caps=1 authpin=0 0x565816747700]
--
2021-12-15T13:41:49.389+0000 7fa42c955700 10 mds.0.cache.ino(0x1000f9e1c03) mark_clean [inode 0x1000f9e1c03 [...2,head] #1000f9e1c03/ auth v82496775 f(v0 m2021-12-15T13:41:09.445977+0000) n(v29 rc2021-12-15T13:41:09.493978+0000 1=0+1) (inest lock) (iversion lock) | dirtyscattered=0 request=0 lock=0 dirfrag=0 caps=0 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=0 0x564ed21ad180]
2021-12-15T13:41:49.389+0000 7fa42c955700 14 mds.0.cache remove_inode [inode 0x1000f9e1c03 [...2,head] #1000f9e1c03/ auth v82496775 f(v0 m2021-12-15T13:41:09.445977+0000) n(v29 rc2021-12-15T13:41:09.493978+0000 1=0+1) (inest lock) (iversion lock) 0x564ed21ad180]
2021-12-15T13:41:49.389+0000 7fa42c955700 10 mds.0.cache.dir(0x600) auth_unpin by 0x564db802df80 on [dir 0x600 ~mds0/stray0/ [2,head] auth{1=2} v=82496950 cv=82495210/82495210 state=1610883072|sticky f(v10 m2021-12-15T13:41:09.493978+0000 6508=6505+3)/f(v10 m2021-12-15T13:41:09.493978+0000 6594=6587+7) n(v41 rc2021-12-15T13:41:09.493978+0000 b951735469 6508=6505+3)/n(v41 rc2021-12-15T13:41:09.493978+0000 b4402296790 6594=6587+7) hs=38+1,ss=0+0 dirty=5 | child=1 sticky=1 replicated=1 dirty=1 waiter=0 authpin=0 0x564dd072fa80] count now 0
2021-12-15T13:41:49.389+0000 7fa42c955700 10 MDSIOContextBase::complete: 18C_PurgeStrayLogged
2021-12-15T13:41:49.389+0000 7fa42c955700 10 MDSContext::complete: 18C_PurgeStrayLogged
2021-12-15T13:41:49.389+0000 7fa42c955700 10 mds.0.cache.strays _purge_stray_logged [dentry #0x100/stray1/20006fdf4cf [2,head] auth (dversion lock) pv=82363725 v=82361701 ino=0x20006fdf4cf state=1073741860 | request=0 lock=0 inodepin=0 purging=1 replicated=0 dirty=0 authpin=0 0x5656b7019b80] [inode 0x20006fdf4cf [...2,head] ~mds0/stray1/20006fdf4cf/ auth v82361701 f(v0 m2021-11-13T07:40:12.200675+0000) n(v0 rc2021-11-13T07:40:12.204675+0000 1=0+1) (inest mix) (iversion lock) 0x5653ac080100]
2021-12-15T13:41:49.389+0000 7fa42c955700 12 mds.0.cache.dir(0x601) unlink_inode [dentry #0x100/stray1/20006fdf4cf [2,head] auth (dversion lock) pv=82363725 v=82361701 ino=0x20006fdf4cf state=1073741860 | request=0 lock=0 inodepin=0 purging=1 replicated=0 dirty=0 authpin=0 0x5656b7019b80] [inode 0x20006fdf4cf [...2,head] ~mds0/stray1/20006fdf4cf/ auth v82361701 f(v0 m2021-11-13T07:40:12.200675+0000) n(v0 rc2021-11-13T07:40:12.204675+0000 1=0+1) (inest mix) (iversion lock) 0x5653ac080100]
2021-12-15T13:41:49.389+0000 7fa42c955700 10 mds.0.cache.den(0x601 20006fdf4cf) mark_dirty [dentry #0x100/stray1/20006fdf4cf [2,head] auth NULL (dversion lock) pv=82363725 v=82361701 ino=(nil) state=1073741924|bottomlru | request=0 lock=0 inodepin=0 purging=1 replicated=0 dirty=0 authpin=0 0x5656b7019b80]
2021-12-15T13:41:49.389+0000 7fa42c955700 10 mds.0.cache.dir(0x601) _mark_dirty (already dirty) [dir 0x601 ~mds0/stray1/ [2,head] auth{1=60} pv=82363726 v=82363724 cv=82363326/82363326 ap=1+0 state=1610883072|sticky f(v10 m2021-12-15T12:29:00.246584+0000 4994=4990+4)/f(v10 m2021-12-15T12:29:00.246584+0000 5092=5085+7) n(v46 rc2021-12-15T12:29:00.246584+0000 b308661142 4994=4990+4)/n(v46 rc2021-12-15T12:29:00.246584+0000 b3299691802 5092=5085+7) hs=14+4,ss=0+0 dirty=4 | child=1 sticky=1 replicated=1 dirty=1 waiter=0 authpin=1 0x564dd072f600] version 82363724
2021-12-15T13:41:49.389+0000 7fa42c955700 15 mds.0.cache.dir(0x601) pop_and_dirty_projected_fnode 0x565abeb16c80 v82363726
2021-12-15T13:41:49.389+0000 7fa42c955700 10 mds.0.cache.dir(0x601) _mark_dirty (already dirty) [dir 0x601 ~mds0/stray1/ [2,head] auth{1=60} v=82363726 cv=82363326/82363326 ap=1+0 state=1610883072|sticky f(v10 m2021-12-15T12:29:00.246584+0000 4993=4990+3)/f(v10 m2021-12-15T12:29:00.246584+0000 5092=5085+7) n(v46 rc2021-12-15T12:29:00.246584+0000 b308661142 4993=4990+3)/n(v46 rc2021-12-15T12:29:00.246584+0000 b3299691802 5092=5085+7) hs=14+4,ss=0+0 dirty=4 | child=1 sticky=1 replicated=1 dirty=1 waiter=0 authpin=1 0x564dd072f600] version 82363726
2021-12-15T13:41:49.389+0000 7fa42c955700 14 mds.0.cache remove_inode [inode 0x20006fdf4cf [...2,head] #20006fdf4cf/ auth v82361701 f(v0 m2021-11-13T07:40:12.200675+0000) n(v0 rc2021-11-13T07:40:12.204675+0000 1=0+1) (inest mix) (iversion lock) 0x5653ac080100]
2021-12-15T13:41:49.389+0000 7fa42c955700 10 mds.0.cache.dir(0x601) auth_unpin by 0x564db802df80 on [dir 0x601 ~mds0/stray1/ [2,head] auth{1=60} v=82363726 cv=82363326/82363326 state=1610883072|sticky f(v10 m2021-12-15T12:29:00.246584+0000 4993=4990+3)/f(v10 m2021-12-15T12:29:00.246584+0000 5092=5085+7) n(v46 rc2021-12-15T12:29:00.246584+0000 b308661142 4993=4990+3)/n(v46 rc2021-12-15T12:29:00.246584+0000 b3299691802 5092=5085+7) hs=14+4,ss=0+0 dirty=4 | child=1 sticky=1 replicated=1 dirty=1 waiter=0 authpin=0 0x564dd072f600] count now 0
2021-12-15T13:41:49.389+0000 7fa42c955700 10 MDSIOContextBase::complete: 18C_Locker_ScatterWB
2021-12-15T13:41:49.389+0000 7fa42c955700 10 MDSContext::complete: 18C_Locker_ScatterWB
2021-12-15T13:41:49.389+0000 7fa42c955700 10 mds.0.locker scatter_writebehind_finish on (inest lock->sync w=1 dirty flushing) on [inode 0x1000888c882 [...2,head] /dataset/VLN-CE/result/ auth v3611392 pv3611394 ap=1 f(v0 m2021-09-14T13:15:33.993729+0000 3=0+3) n(v208168 rc2021-12-15T13:41:25.088058+0000 b8183526684874 12513=10162+2351)/n(v208168 rc2021-12-15T13:41:25.088058+0000 b8183526683964 12513=10162+2351) (isnap sync r=4) (inest lock->sync w=1 dirty flushing) (ipolicy sync r=4) (iversion lock) caps={7794103=pAsLsXsFs/-@2,7794480=pAsLsXsFs/-@2,7854301=pAsLsXsFs/-@2,7976767=pAsLsXsFs/AsLsXsFsx@5,8048885=pAsLsXsFs/-@2,8094446=pAsLsXsFs/-@2,8115205=pAsLsXsFs/AsLsXsFsx@3,8134535=pAsLsXsFs/-@2} | dirtyscattered=1 request=0 lock=3 dirfrag=1 caps=1 dirtyrstat=1 dirty=1 waiter=0 authpin=1 0x564db9a0cc00]
2021-12-15T13:41:49.389+0000 7fa42c955700 15 mds.0.cache.ino(0x1000888c882) pop_and_dirty_projected_inode v3611394

No relevant log from rank 1 was found.

I just gradually reduce the "mds_cache_memory_limit" of rank 0 to accelerate `trim_lru`. So this inode is purged by the regular LRU trim.

Now rank 1 is crashed on yet another inode 0x20006a5f2f9 at ~mds0/stray3/20006a5f2f9/

With the `dump tree` command you post, I roughly understood what happened.

Take inode 0x20006a5f2f9 as an example. This dir is deleted, but some client still holds a cap to it, so it cannot be purged. before I reduce max_mds, it is located in ~mds1/stray3/20006a5f2f9 (I know this by looking back to the saved `dump tree` result). Once I set max_mds to 1, mds1 renamed it to ~mds0/stray3. While the rename seems successful, but just after that, the exporting goes wrong.

I also see many slow rename ops while rank 1 is stopping, which may or may not be related to this issue. I checked some of the inodes. They have `nlink > 0`. Some examples:

[WRN] slow request 263.643426 seconds old, received at 2021-12-15T14:09:20.236388+0000: client_request(mds.1:246 rename #0x602/2000150b00f #0x60c/2000150b00f caller_uid=0, caller_gid=0{}) currently acquired locks
[WRN] slow request 92.115930 seconds old, received at 2021-12-15T14:15:45.872428+0000: client_request(mds.1:7567 rename #0x603/2000541e067 #0x60d/2000541e067 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting

#20 Updated by Xiubo Li over 2 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to 玮文 胡
  • Pull request ID set to 44335

#21 Updated by Xiubo Li over 2 years ago

  • Affected Versions v17.0.0 added
  • Affected Versions deleted (v16.2.6)

#22 Updated by Xiubo Li over 2 years ago

  • Backport set to pacific

#23 Updated by Xiubo Li over 2 years ago

  • Priority changed from Normal to High

#24 Updated by Patrick Donnelly about 2 years ago

  • Subject changed from FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) to mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())
  • Target version set to v17.0.0

#25 Updated by Venky Shankar almost 2 years ago

  • Category set to Correctness/Safety
  • Status changed from Fix Under Review to Pending Backport
  • Target version changed from v17.0.0 to v18.0.0
  • Backport changed from pacific to quincy, pacific

#26 Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55928: quincy: mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) added

#27 Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55929: pacific: mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) added

#28 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#29 Updated by Xiubo Li over 1 year ago

  • Status changed from Pending Backport to Resolved

#30 Updated by Igor Fedotov 8 months ago

  • Related to Bug #62381: mds: Bug still exists: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) added

Also available in: Atom PDF