Bug #44380
closedmds: MDCache.cc: 2335: FAILED ceph_assert(!"unmatched rstat rbytes" == g_conf()->mds_verify_scatter)
0%
Description
2020-02-29T13:29:53.496+0000 7f67c0191700 10 MDSIOContextBase::complete: 13C_MDC_Recover 2020-02-29T13:29:53.496+0000 7f67c0191700 10 MDSContext::complete: 13C_MDC_Recover 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0 RecoveryQueue::_recovered _recovered r=0 size=2 mtime=2020-02-29T13:29:33.859441+0000 for [inode 0x10000000016 [head,head] /subdir/21 auth v356 ap=2 recovering s=2 n() (ifile excl->sync) (iversion lock) cr={5509=0-4194304@fffffffffffffffd} | ptrwaiter=0 request=0 lock=0 caps=0 dirtyrstat=1 dirty=0 authpin=1 0x55e6b547a800] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.locker check_inode_max_size new_ranges {} update_size 1 on [inode 0x10000000016 [head,head] /subdir/21 auth v356 ap=2 s=2 n() (ifile excl->sync) (iversion lock) cr={5509=0-4194304@fffffffffffffffd} | ptrwaiter=0 request=0 lock=0 caps=0 dirtyrstat=1 dirty=0 authpin=1 0x55e6b547a800] 2020-02-29T13:29:53.496+0000 7f67c0191700 15 mds.0.cache.ino(0x10000000016) project_inode 0x10000000016 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.dir(0x10000000000) pre_dirty 358 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.den(0x10000000000 21) pre_dirty [dentry #0x1/subdir/21 [head,head] auth (dversion lock) pv=358 v=356 ino=0x10000000016 state=1073741824 | lock=0 fragmenting=0 inodepin=1 dirty=0 authpin=0 0x55e6b5478480] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.ino(0x10000000016) pre_dirty 358 (current v 356) 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.locker check_inode_max_size client_ranges {5509=0-4194304@fffffffffffffffd} -> {} 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.locker check_inode_max_size size 2 -> 2 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.locker check_inode_max_size mtime 2020-02-29T13:29:33.000000+0000 -> 2020-02-29T13:29:33.859441+0000 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache predirty_journal_parents linkunlink=0 primary_dn follows head [inode 0x10000000016 [head,head] /subdir/21 auth v356 pv358 ap=2 s=2 n()->n(v0 rc2020-02-29T13:29:33.859441+0000 b2)/n() (ifile excl->sync) (iversion lock) cr={5509=0-4194304@fffffffffffffffd} | ptrwaiter=0 request=0 lock=0 caps=0 dirtyrstat=1 dirty=0 authpin=1 0x55e6b547a800] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.dir(0x10000000000) auth_pin by 0x55e6b54cd440 on [dir 0x10000000000 /subdir/ [2,head] auth v=357 cv=357/357 ap=1+2 state=1073741825|complete f(v0 100=100+0)/f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=0 dirty=0 authpin=1 0x55e6b54b2580] count now 1 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.dir(0x10000000000) project_fnode 0x55e6b45e4b10 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.dir(0x10000000000) pre_dirty 359 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache taking wrlock on (inest lock) on [inode 0x10000000000 [...2,head] /subdir/ auth v410 f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1) (inest lock) (iversion lock) | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=0 dirtyrstat=0 dirty=1 waiter=0 authpin=0 0x55e6b5419800] 2020-02-29T13:29:53.496+0000 7f67c0191700 7 mds.0.locker wrlock_force on (inest lock) on [inode 0x10000000000 [...2,head] /subdir/ auth v410 f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1) (inest lock) (iversion lock) | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=0 dirtyrstat=0 dirty=1 waiter=0 authpin=0 0x55e6b5419800] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.snaprealm(0x1 seq 1 0x55e6b5404500) have_past_parents_open [1,head] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache projected_rstat_inode_to_frag first head linkunlink 0 [inode 0x10000000016 [head,head] /subdir/21 auth v356 pv358 ap=2 s=2 n()->n(v0 rc2020-02-29T13:29:33.859441+0000 b2)/n() (ifile excl->sync) (iversion lock) cr={5509=0-4194304@fffffffffffffffd} | ptrwaiter=0 request=0 lock=0 caps=0 dirtyrstat=1 dirty=0 authpin=1 0x55e6b547a800] 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache frag head is [2,head] 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache inode update is [head,head] 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache floor of head from parent dn [dentry #0x1/subdir/21 [head,head] auth (dversion lock) pv=358 v=356 ino=0x10000000016 state=1073741824 | lock=0 fragmenting=0 inodepin=1 dirty=0 authpin=0 0x55e6b5478480] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.snaprealm(0x1 seq 1 0x55e6b5404500) have_past_parents_open [1,head] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.snaprealm(0x1 seq 1 0x55e6b5404500) get_snaps (seq 1 cached_seq 1) 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache _project_rstat_inode_to_frag [head,head] 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache inode rstat n(v0 rc2020-02-29T13:29:33.859441+0000 b2) 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache inode accounted_rstat n() 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache delta n(v0 rc2020-02-29T13:29:33.859441+0000 b2) 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache projecting to head [2,head] n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0) 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache project to [2,head] n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0) 2020-02-29T13:29:53.496+0000 7f67c91a3700 1 -- [v2:172.21.15.2:6832/2828792259,v1:172.21.15.2:6833/2828792259] >> v1:172.21.15.40:0/1109131888 conn(0x55e6b53bc000 legacy=0x55e6b541b000 unknown :6833 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 38 2020-02-29T13:29:53.496+0000 7f67c91a3700 1 -- [v2:172.21.15.2:6832/2828792259,v1:172.21.15.2:6833/2828792259] >> v1:172.21.15.40:0/1109131888 conn(0x55e6b53bc000 legacy=0x55e6b541b000 unknown :6833 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache result [2,head] n(v1 rc2020-02-29T13:29:33.859441+0000 b190 99=99+0) [dir 0x10000000000 /subdir/ [2,head] auth pv=359 v=357 cv=357/357 ap=1+2 state=1073741825|complete f(v0 100=100+0)/f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0)->f(v0 100=100+0)/f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0)->n(v1 rc2020-02-29T13:29:33.859441+0000 b190 99=99+0)/n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=0 dirty=0 authpin=1 0x55e6b54b2580] 2020-02-29T13:29:53.496+0000 7f67c91a3700 1 --1- [v2:172.21.15.2:6832/2828792259,v1:172.21.15.2:6833/2828792259] >> v1:172.21.15.40:0/1109131888 conn(0x55e6b53bc000 0x55e6b541b000 :6833 s=OPENED pgs=2 cs=1 l=1).handle_message read tag failed 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.ino(0x10000000016) clear_dirty_rstat 2020-02-29T13:29:53.496+0000 7f67c91a3700 1 --1- [v2:172.21.15.2:6832/2828792259,v1:172.21.15.2:6833/2828792259] >> v1:172.21.15.40:0/1109131888 conn(0x55e6b53bc000 0x55e6b541b000 :6833 s=OPENED pgs=2 cs=1 l=1).fault on lossy channel, failing 2020-02-29T13:29:53.496+0000 7f67c0191700 7 mds.0.locker local_wrlock_grab on (iversion lock) on [inode 0x10000000000 [...2,head] /subdir/ auth v410 f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1) (inest lock w=1) (iversion lock) | dirtyscattered=0 request=0 lock=1 dirfrag=1 caps=0 dirtyrstat=0 dirty=1 waiter=0 authpin=0 0x55e6b5419800] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.ino(0x10000000000) auth_pin by 0x55e6b54cd440 on [inode 0x10000000000 [...2,head] /subdir/ auth v410 ap=1 f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1) (inest lock w=1) (iversion lock w=1) | dirtyscattered=0 request=0 lock=2 dirfrag=1 caps=0 dirtyrstat=0 dirty=1 waiter=0 authpin=1 0x55e6b5419800] now 1 2020-02-29T13:29:53.496+0000 7f67c0191700 15 mds.0.cache.dir(0x1) adjust_nested_auth_pins 1 on [dir 0x1 / [2,head] auth v=411 cv=409/409 dir_auth=0 ap=0+1 state=1610612737|complete f(v0 m2020-02-29T13:29:28.065624+0000 1=0+1) n(v2 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=0 authpin=0 0x55e6b529f600] by 0x55e6b5419800 count now 0/1 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.snaprealm(0x3 seq 1 0x55e6b5404280) have_past_parents_open [1,head] 2020-02-29T13:29:53.496+0000 7f67c0191700 15 mds.0.cache.ino(0x10000000000) project_inode 0x10000000000 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.dir(0x1) pre_dirty 412 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.den(0x1 subdir) pre_dirty [dentry #0x1/subdir [2,head] auth (dversion lock) pv=412 v=410 ino=0x10000000000 state=1610612736 | request=0 lock=0 inodepin=1 dirty=1 authpin=0 0x55e6b5416fc0] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.ino(0x10000000000) pre_dirty 412 (current v 410) 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache predirty_journal_parents frag->inode on [dir 0x10000000000 /subdir/ [2,head] auth pv=359 v=357 cv=357/357 ap=1+2 state=1073741825|complete f(v0 100=100+0)/f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0)->f(v0 100=100+0)/f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0)->n(v1 rc2020-02-29T13:29:33.859441+0000 b190 99=99+0)/n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=0 dirty=0 authpin=1 0x55e6b54b2580] 2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache project_rstat_frag_to_inode [2,head] 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache frag rstat n(v1 rc2020-02-29T13:29:33.859441+0000 b190 99=99+0) 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache frag accounted_rstat n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0) 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache delta n(v1 rc2020-02-29T13:29:33.859441+0000 b2) 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache projecting to [2,head] n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1) 2020-02-29T13:29:53.496+0000 7f67c0191700 20 mds.0.cache result [2,head] n(v1 rc2020-02-29T13:29:33.859441+0000 b192 101=100+1) 2020-02-29T13:29:53.496+0000 7f67c0191700 -1 log_channel(cluster) log [ERR] : unmatched rstat rbytes on single dirfrag 0x10000000000, inode has n(v1 rc2020-02-29T13:29:33.859441+0000 b192 101=100+1), dirfrag has n(v1 rc2020-02-29T13:29:33.859441+0000 b190 99=99+0) 2020-02-29T13:29:53.497+0000 7f67c0191700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1435-g8e904aa/rpm/el8/BUILD/ceph-15.1.0-1435-g8e904aa/src/mds/MDCache.cc: In function 'void MDCache::predirty_journal_parents(MutationRef, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)' thread 7f67c0191700 time 2020-02-29T13:29:53.497532+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1435-g8e904aa/rpm/el8/BUILD/ceph-15.1.0-1435-g8e904aa/src/mds/MDCache.cc: 2335: FAILED ceph_assert(!"unmatched rstat rbytes" == g_conf()->mds_verify_scatter) ceph version 15.1.0-1435-g8e904aa (8e904aaab37a2615eec03fd9c9f166a80774d91e) octopus (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f67ce2e30f0] 2: (()+0x27930a) [0x7f67ce2e330a] 3: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x232a) [0x55e6b32d923a] 4: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned long, utime_t)+0x414) [0x55e6b33a9ab4] 5: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x8bd) [0x55e6b33819dd] 6: (MDSContext::complete(int)+0x56) [0x55e6b34c92d6] 7: (MDSIOContextBase::complete(int)+0xa3) [0x55e6b34c94e3] 8: (Filer::C_Probe::finish(int)+0xb5) [0x55e6b3565e75] 9: (Context::complete(int)+0xd) [0x55e6b31f9f5d] 10: (Finisher::finisher_thread_entry()+0x1a5) [0x7f67ce373a05] 11: (()+0x82de) [0x7f67cd10f2de] 12: (clone()+0x43) [0x7f67cbca2133]
From: /ceph/teuthology-archive/pdonnell-2020-02-29_02:56:38-kcephfs-wip-pdonnell-testing-20200229.001503-distro-basic-smithi/4811126/remote/smithi002/log/ceph-mds.a.log.gz
Only PR in this batch related to this code path is: https://github.com/ceph/ceph/pull/33538
I don't see how #33538 would cause this.
The last run of kcephfs on Feb 16th did not have this issue: http://pulpito.ceph.com/pdonnell-2020-02-16_17:35:17-kcephfs-wip-pdonnell-testing-20200215.033325-distro-basic-smithi/
Updated by Zheng Yan about 4 years ago
test that triggered the assertion is test_fragmented_injection. I think we need to add following settings
self.fs.set_ceph_conf('mds', 'mds verify scatter', False)
self.fs.set_ceph_conf('mds', 'mds debug scatterstat', False)
Updated by Xiubo Li about 4 years ago
Zheng Yan wrote:
test that triggered the assertion is test_fragmented_injection. I think we need to add following settings
self.fs.set_ceph_conf('mds', 'mds verify scatter', False)
self.fs.set_ceph_conf('mds', 'mds debug scatterstat', False)
Yeah, tried them by setting to True, locally I couldn't reproduce it by running for more than 30 times till now.
Updated by Greg Farnum about 4 years ago
Zheng Yan wrote:
test that triggered the assertion is test_fragmented_injection. I think we need to add following settings
self.fs.set_ceph_conf('mds', 'mds verify scatter', False)
self.fs.set_ceph_conf('mds', 'mds debug scatterstat', False)
That's just disabling the debug checking; I see this test pokes at our metadata but I think the data scan should clean it up. What steps lead to the rstats going wrong?
Updated by Zheng Yan about 4 years ago
Greg Farnum wrote:
That's just disabling the debug checking; I see this test pokes at our metadata but I think the data scan should clean it up. What steps lead to the rstats going wrong?
date scan does not fix fragstat / rstat. The test uses 'rados oamprmkey' to remove a dentry, which caused the incorrect rstat.
Updated by Xiubo Li about 4 years ago
Zheng Yan wrote:
Greg Farnum wrote:
That's just disabling the debug checking; I see this test pokes at our metadata but I think the data scan should clean it up. What steps lead to the rstats going wrong?
date scan does not fix fragstat / rstat. The test uses 'rados oamprmkey' to remove a dentry, which caused the incorrect rstat.
The "test_fragmented_injection" test case will stop all the mds daemons when doing the "rmomapkey 21_head" for "subdir/21" and data scan. And later when starting the mds daemons all the dentries under "subdir/" will be loaded from pools, including the "subdir/21", so it shouldn't matter that the data_scan does not update the rstat/dirstat, right ?
I checked the teuthology.log, the test_fragmented_injection test case succeeded:
2020-02-29T13:29:54.230 INFO:teuthology.orchestra.run.smithi002.stderr:2020-02-29T13:29:54.198+0000 7fceda039700 1 -- 172.21.15.2:0/3828554392 shutdown_connections
2020-02-29T13:29:54.230 INFO:teuthology.orchestra.run.smithi002.stderr:2020-02-29T13:29:54.198+0000 7fceda039700 1 -- 172.21.15.2:0/3828554392 wait complete.
2020-02-29T13:29:54.230 INFO:tasks.cephfs_test_runner:test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) ... ok
2020-02-29T13:29:54.231 INFO:tasks.cephfs_test_runner:Starting test: test_parallel_execution (tasks.cephfs.test_data_scan.TestDataScan)
All the dentries under /subdir/ should be regular files and the file number is 100, no directory exists, but just when taking the wrlock, it has one new directory and one new regular file added by some one: we can see "n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1)" from:
2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache taking wrlock on (inest lock) on [inode 0x10000000000 [...2,head] /subdir/ auth v410 f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1) (inest lock) (iversion lock) | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=0 dirtyrstat=0 dirty=1 waiter=0 authpin=0 0x55e6b5419800]
2020-02-29T13:29:53.496+0000 7f67c0191700 7 mds.0.locker wrlock_force on (inest lock) on [inode 0x10000000000 [...2,head] /subdir/ auth v410 f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1) (inest lock) (iversion lock) | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=0 dirtyrstat=0 dirty=1 waiter=0 authpin=0 0x55e6b5419800]
But just before taking the wrlock, we can see that the "/subdir" still have 99 regluar files, no directory:
2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache predirty_journal_parents linkunlink=0 primary_dn follows head [inode 0x10000000016 [head,head] /subdir/21 auth v356 pv358 ap=2 s=2 n()->n(v0 rc2020-02-29T13:29:33.859441+0000 b2)/n() (ifile excl->sync) (iversion lock) cr={5509=0-4194304@fffffffffffffffd} | ptrwaiter=0 request=0 lock=0 caps=0 dirtyrstat=1 dirty=0 authpin=1 0x55e6b547a800]
2020-02-29T13:29:53.496+0000 7f67c0191700 10 mds.0.cache.dir(0x10000000000) auth_pin by 0x55e6b54cd440 on [dir 0x10000000000 /subdir/ [2,head] auth v=357 cv=357/357 ap=1+2 state=1073741825|complete f(v0 100=100+0)/f(v0 m2020-02-29T13:29:28.159623+0000 100=100+0) n(v1 rc2020-02-29T13:29:28.160623+0000 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=0 dirty=0 authpin=1 0x55e6b54b2580] count now 1
including the "subdir/21" here, the regular file number is 100.
After the test_fragmented_injection test case, the "subdir/*" will be kept, and it could be used by other test cases.
BTW, in which case will the recovery queue be triggered ? I didn't see any file recovery test case was ran.
BRs
Updated by Zheng Yan about 4 years ago
Xiubo Li wrote:
Zheng Yan wrote:
Greg Farnum wrote:
That's just disabling the debug checking; I see this test pokes at our metadata but I think the data scan should clean it up. What steps lead to the rstats going wrong?
rstat can go wrong if mds merge dirfrags automatically before the data scan
date scan does not fix fragstat / rstat. The test uses 'rados oamprmkey' to remove a dentry, which caused the incorrect rstat.
The "test_fragmented_injection" test case will stop all the mds daemons when doing the "rmomapkey 21_head" for "subdir/21" and data scan. And later when starting the mds daemons all the dentries under "subdir/" will be loaded from pools, including the "subdir/21", so it shouldn't matter that the data_scan does not update the rstat/dirstat, right ?
I checked the teuthology.log, the test_fragmented_injection test case succeeded:
[...]
All the dentries under /subdir/ should be regular files and the file number is 100, no directory exists, but just when taking the wrlock, it has one new directory and one new regular file added by some one: we can see "n(v1 rc2020-02-29T13:29:28.160623+0000 b190 101=100+1)" from:
the one dir in rstat is directory itself
[...]
But just before taking the wrlock, we can see that the "/subdir" still have 99 regluar files, no directory:
[...]
including the "subdir/21" here, the regular file number is 100.
After the test_fragmented_injection test case, the "subdir/*" will be kept, and it could be used by other test cases.
BTW, in which case will the recovery queue be triggered ? I didn't see any file recovery test case was ran.
BRs
Updated by Xiubo Li about 4 years ago
I could reproduce it %100 locally, the steps are:
1), run the test_fragmented_injection test case
2), manually remove the subdir/ manually from the mountpoint.
441 2020-04-08T01:06:41.005-0400 7f00ebd3d700 10 mds.0.server _unlink_local [dentry #0x1/subdir/21 [head,head] auth (dn xlock x=1 by 0x557c45093 000) (dversion lock w=1 last_client=35588) pv=0 v=303 ap=2 ino=0x1000000020a state=1073741824 | request=1 lock=2 fragmenting=0 inodepin=1 au thpin=1 0x557c450ade00]
442 2020-04-08T01:06:41.005-0400 7f00ebd3d700 10 mds.0.cache.dir(0x10000000000) pre_dirty 305
443 2020-04-08T01:06:41.005-0400 7f00ebd3d700 10 mds.0.cache.den(0x10000000000 21) pre_dirty [dentry #0x1/subdir/21 [head,head] auth (dn xlock x =1 by 0x557c45093000) (dversion lock w=1 last_client=35588) pv=305 v=303 ap=2 ino=0x1000000020a state=1073741824 | request=1 lock=2 fragment ing=0 inodepin=1 authpin=1 0x557c450ade00]
444 2020-04-08T01:06:41.005-0400 7f00ebd3d700 15 mds.0.cache.ino(0x1000000020a) project_inode 0x1000000020a
445 2020-04-08T01:06:41.005-0400 7f00ebd3d700 10 mds.0.cache.dir(0x601) pre_dirty 3
446 2020-04-08T01:06:41.005-0400 7f00ebd3d700 10 mds.0.cache.den(0x601 1000000020a) pre_dirty [dentry #0x100/stray1/1000000020a [2,head] auth NU LL (dn xlock x=1 by 0x557c45093000) (dversion lock w=1 last_client=35588) pv=3 v=2 ap=2 ino=(nil) state=1342177345|new|bottomlru | request=1 lock=2 authpin=1 0x557c450c8600]
447 2020-04-08T01:06:41.005-0400 7f00ebd3d700 10 mds.0.cache.ino(0x1000000020a) pre_dirty 3 (current v 1)
448 2020-04-08T01:06:41.005-0400 7f00ebd3d700 10 mds.0.cache predirty_journal_parents do_parent_mtime linkunlink=-1 primary_dn follows head [ino de 0x1000000020a [head,head] ~mds0/stray1/1000000020a auth v1 pv3 ap=3 s=2 n()->n(v0 rc2020-04-08T01:06:40.995200-0400)/n() (ilink xlock x=1 by 0x557c45093000) (iversion lock w=1 last_client=35588) caps={35588=pAsXsFscr/-@3} | request=1 lock=3 caps=1 authpin=1 0x557c450b5c00]
449 2020-04-08T01:06:41.005-0400 7f00ebd3d700 10 mds.0.cache.dir(0x10000000000) auth_pin by 0x557c45093000 on [dir 0x10000000000 /subdir/ [2,hea d] auth v=304 cv=304/304 ap=2+5 state=1074003969|complete f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc2020-04-07 T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=0 dirty=0 authpin=1 0x557c450cc000] count now 2
450 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.dir(0x10000000000) project_fnode 0x557c45058410
451 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.dir(0x10000000000) pre_dirty 306
452 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache predirty_journal_parents bumping change_attr to 1 on 0x557c450cc000
453 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache predirty_journal_parents updating mtime on [dir 0x10000000000 /subdir/ [2,head] aut h pv=306 v=304 cv=304/304 ap=2+5 state=1074003969|complete f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0)->f(v0 m2020-04-0 8T01:06:40.995200-0400 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0)->n(v 0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=0 dirty=0 authpin=1 0x557c450cc000]
454 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache predirty_journal_parents updating size on [dir 0x10000000000 /subdir/ [2,head] auth pv=306 v=304 cv=304/304 ap=2+5 state=1074003969|complete f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0)->f(v0 m2020-04-08 T01:06:40.995200-0400 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0)->n(v0 rc2020-04-08T01:06:40.995200-0400 b188 99=99+0)/n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=0 dir ty=0 authpin=1 0x557c450cc000]
455 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.snaprealm(0x1 seq 1 0x557c45023b80) have_past_parents_open [1,head]
456 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache projected_rstat_inode_to_frag first head linkunlink -1 [inode 0x1000000020a [head,h ead] ~mds0/stray1/1000000020a auth v1 pv3 ap=3 s=2 n()->n(v0 rc2020-04-08T01:06:40.995200-0400)/n() (ilink xlock x=1 by 0x557c45093000) (ive rsion lock w=1 last_client=35588) caps={35588=pAsXsFscr/-@3} | request=1 lock=3 caps=1 authpin=1 0x557c450b5c00]
457 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache frag head is [2,head]
458 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache inode update is [head,head]
459 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache floor of 2 from parent dn [dentry #0x100/stray1/1000000020a [2,head] auth NULL (dn xlock x=1 by 0x557c45093000) (dversion lock w=1 last_client=35588) pv=3 v=2 ap=2 ino=(nil) state=1342177345|new|bottomlru | request=1 lock= 2 authpin=1 0x557c450c8600]
460 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.snaprealm(0x1 seq 1 0x557c45023b80) have_past_parents_open [1,head]
461 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.snaprealm(0x1 seq 1 0x557c45023b80) get_snaps (seq 1 cached_seq 1)
462 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache _project_rstat_inode_to_frag [head,head]
463 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache inode rstat n(v0 rc2020-04-08T01:06:40.995200-0400)
464 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache inode accounted_rstat n()
465 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache delta n()
466 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache projecting to head [2,head] n(v0 rc2020-04-08T01:06:40.995200-0400 b188 99=99+0)
467 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache project to [2,head] n(v0 rc2020-04-08T01:06:40.995200-0400 b188 99=99+0)
468 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache result [2,head] n(v0 rc2020-04-08T01:06:40.995200-0400 b188 99=99+0) [dir 0x1 0000000000 /subdir/ [2,head] auth pv=306 v=304 cv=304/304 ap=2+5 state=1074003969|complete f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058- 0400 100=100+0)->f(v0 m2020-04-08T01:06:40.995200-0400 99=99+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc2020-04-07T20:35:35. 448479-0400 b188 99=99+0)->n(v0 rc2020-04-08T01:06:40.995200-0400 b188 99=99+0)/n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+ 0,ss=0+0 | child=1 frozen=0 dirty=0 authpin=1 0x557c450cc000]
469 2020-04-08T01:06:41.006-0400 7f00ebd3d700 7 mds.0.locker local_wrlock_grab on (iversion lock) on [inode 0x10000000000 [...2,head] /subdir/ auth v407 ap=2 f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc2020-04-07T20:35:35.448479-0400 b190 101=100+1) (isnap sync r=1) (in est lock w=1) (ifile lock w=1) (iversion lock) caps={35588=pAsLsXs/-@2},l=35588 | request=0 lock=3 dirfrag=1 caps=1 dirty=0 waiter=0 authpin =1 0x557c4504fc00]
470 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.snaprealm(0x3 seq 1 0x557c45023900) have_past_parents_open [1,head]
471 2020-04-08T01:06:41.006-0400 7f00ebd3d700 15 mds.0.cache.ino(0x10000000000) project_inode 0x10000000000
472 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.dir(0x1) pre_dirty 408
473 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.den(0x1 subdir) pre_dirty [dentry #0x1/subdir [2,head] auth (dversion lock) pv=408 v=407 ino=0x10000000000 state=1073741824 | request=0 lock=0 inodepin=1 dirty=0 authpin=0 0x557c45044600]
474 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache.ino(0x10000000000) pre_dirty 408 (current v 407)
475 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache predirty_journal_parents add_delta f(v0 m2020-04-08T01:06:40.995200-0400 99=99+0)
476 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache predirty_journal_parents - f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0)
477 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache predirty_journal_parents gives f(v0 m2020-04-08T01:06:40.995200-0400 99=99+0) o n [inode 0x10000000000 [...2,head] /subdir/ auth v407 pv408 ap=2 f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0)->f(v0 m2020-04-08T01:06:40 .995200-0400 99=99+0) n(v0 rc2020-04-07T20:35:35.448479-0400 b190 101=100+1)->n(v0 rc2020-04-07T20:35:35.448479-0400 b190 101=100+1) (isnap sync r=1) (inest lock w=1) (ifile lock w=1) (iversion lock w=1 last_client=35588) caps={35588=pAsLsXs/-@2},l=35588 | request=0 lock=4 dirfra g=1 caps=1 dirty=0 waiter=0 authpin=1 0x557c4504fc00]
478 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache predirty_journal_parents frag->inode on [dir 0x10000000000 /subdir/ [2,head] auth p v=306 v=304 cv=304/304 ap=2+5 state=1074003969|complete f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0)->f(v0 m2020-04-08T0 1:06:40.995200-0400 99=99+0) n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0)->n(v0 rc2020-04-08T01:06:40.995200-0400 b188 99=99+0)/n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=0 dirty=0 authpin=1 0x557c450cc000]
479 2020-04-08T01:06:41.006-0400 7f00ebd3d700 10 mds.0.cache project_rstat_frag_to_inode [2,head]
480 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache frag rstat n(v0 rc2020-04-08T01:06:40.995200-0400 b188 99=99+0)
481 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache frag accounted_rstat n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0)
482 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache delta n(v0 rc2020-04-08T01:06:40.995200-0400)
483 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache projecting to [2,head] n(v0 rc2020-04-07T20:35:35.448479-0400 b190 101=100+1)
484 2020-04-08T01:06:41.006-0400 7f00ebd3d700 20 mds.0.cache result [2,head] n(v0 rc2020-04-08T01:06:40.995200-0400 b190 101=100+1)
485 2020-04-08T01:06:41.007-0400 7f00ebd3d700 -1 log_channel(cluster) log [ERR] : unmatched rstat rbytes on single dirfrag 0x10000000000, inode has n(v0 rc2020-04-08T01:06:40.995200-0400 b190 101=100+1), dirfrag has n(v0 rc2020-04-08T01:06:40.995200-0400 b188 99=99+0)
486 2020-04-08T01:06:41.031-0400 7f00ebd3d700 -1 /data/ceph/src/mds/MDCache.cc: In function 'void MDCache::predirty_journal_parents(MutationRef, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)' thread 7f00ebd3d700 time 2020-04-08T01:06:41.008923-0400
487 /data/ceph/src/mds/MDCache.cc: 2335: FAILED ceph_assert(!"unmatched rstat rbytes" == g_conf()->mds_verify_scatter)
488
489 ceph version 15.2.0-328-ge7d98c3 (e7d98c37a96c82f76e51ca97ffc39657178ebce7) octopus (rc)
490 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x7f00f4a4b56c]
491 2: (()+0x14807ee) [0x7f00f4a4b7ee]
492 3: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x24f0) [0x557c41 adc0f6]
493 4: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x73f) [0x557c41a1092b]
494 5: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x15b0) [0x557c41a1001a]
495 6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x129c) [0x557c419e91cc]
496 7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x15ac) [0x557c419e7720]
497 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x99f) [0x557c419cf575]
498 9: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x814) [0x557c4195e864]
499 10: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x387) [0x557c4195d319]
500 11: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0xdd) [0x557c4195cea7]
501 12: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1f3) [0x557c4193c5f3]
502 13: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xe9) [0x7f00f4c145b1]
503 14: (DispatchQueue::entry()+0x5fc) [0x7f00f4c13284]
504 15: (DispatchQueue::DispatchThread::entry()+0x1c) [0x7f00f4d6c9ae]
505 16: (Thread::entry_wrapper()+0x78) [0x7f00f49e91b0]
506 17: (Thread::_entry_func(void*)+0x18) [0x7f00f49e912e]
507 18: (()+0x7e65) [0x7f00f2705e65]
508 19: (clone()+0x6d) [0x7f00f13b388d]
509
510 2020-04-08T01:06:41.057-0400 7f00ebd3d700 -1 *** Caught signal (Aborted) **
511 in thread 7f00ebd3d700 thread_name:ms_dispatch
512
513 ceph version 15.2.0-328-ge7d98c3 (e7d98c37a96c82f76e51ca97ffc39657178ebce7) octopus (rc)
514 1: (()+0xd8d80c) [0x557c41ed880c]
515 2: (()+0xf5f0) [0x7f00f270d5f0]
516 3: (gsignal()+0x37) [0x7f00f12eb337]
517 4: (abort()+0x148) [0x7f00f12eca28]
518 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x379) [0x7f00f4a4b73b]
519 6: (()+0x14807ee) [0x7f00f4a4b7ee]
520 7: (MDCache::predirty_journal_parents(boost::intrusive_ptr<MutationImpl>, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x24f0) [0x557c41 adc0f6]
521 8: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x73f) [0x557c41a1092b]
522 9: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x15b0) [0x557c41a1001a]
523 10: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x129c) [0x557c419e91cc]
524 11: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x15ac) [0x557c419e7720]
525 12: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x99f) [0x557c419cf575]
526 13: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x814) [0x557c4195e864]
527 14: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x387) [0x557c4195d319]
528 15: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0xdd) [0x557c4195cea7]
529 16: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1f3) [0x557c4193c5f3]
530 17: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xe9) [0x7f00f4c145b1]
531 18: (DispatchQueue::entry()+0x5fc) [0x7f00f4c13284]
532 19: (DispatchQueue::DispatchThread::entry()+0x1c) [0x7f00f4d6c9ae]
533 20: (Thread::entry_wrapper()+0x78) [0x7f00f49e91b0]
534 21: (Thread::_entry_func(void*)+0x18) [0x7f00f49e912e]
535 22: (()+0x7e65) [0x7f00f2705e65]
536 23: (clone()+0x6d) [0x7f00f13b388d]
537 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
538
The problem is after dirfrag merge done, the dirfrag rstat is not correct and is "b188 99=99+0", but "190 100=100+0" is expected:
3754 2020-04-07T20:36:35.649-0400 7f00e5530700 10 MDSContext::complete: 18C_MDC_FragmentPrep
3755 2020-04-07T20:36:35.649-0400 7f00e5530700 10 mds.0.cache fragment_logged 0x10000000000 bits -1 on [inode 0x10000000000 [...2,head] /subdir/ auth v403 pv407 ap=2 f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0)->f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc2020-04-07T 20:35:35.448479-0400 b190 101=100+1)->n(v0 rc2020-04-07T20:35:35.448479-0400 b190 101=100+1) (idft lock w=1) (inest lock w=1) (ifile excl w =1) (iversion lock) caps={35564=pAsLsXsFs/-@5},l=35564 | request=0 lock=3 dirfrag=1 caps=1 waiter=0 authpin=1 0x557c4504fc00]
3756 2020-04-07T20:36:35.649-0400 7f00e5530700 15 mds.0.cache.ino(0x10000000000) pop_and_dirty_projected_inode 0x10000000000 v407
3757 2020-04-07T20:36:35.649-0400 7f00e5530700 10 mds.0.cache.ino(0x10000000000) mark_dirty [inode 0x10000000000 [...2,head] /subdir/ auth v403 pv407 ap=2 f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0)->f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc2020-04-07T20:35:35.44 8479-0400 b190 101=100+1)->n(v0 rc2020-04-07T20:35:35.448479-0400 b190 101=100+1) (idft lock w=1) (inest lock w=1) (ifile excl w=1) (iversi on lock) caps={35564=pAsLsXsFs/-@5},l=35564 | request=0 lock=3 dirfrag=1 caps=1 waiter=0 authpin=1 0x557c4504fc00]
3758 2020-04-07T20:36:35.649-0400 7f00e5530700 10 mds.0.cache.den(0x1 subdir) mark_dirty [dentry #0x1/subdir [2,head] auth (dversion lock) pv=40 7 v=405 ino=0x10000000000 state=1073741824 | request=0 lock=0 inodepin=1 authpin=0 0x557c45044600]
3759 2020-04-07T20:36:35.649-0400 7f00e5530700 10 mds.0.cache.dir(0x1) _mark_dirty (already dirty) [dir 0x1 / [2,head] auth v=407 cv=0/0 dir_aut h=0 ap=0+2 state=1610874881|complete f(v0 m2020-04-07T20:35:34.406134-0400 1=0+1) n(v1 rc2020-04-07T20:35:35.448479-0400 b190 101=100+1) hs =1+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=0 authpin=0 0x557c4430b700] version 407
3760 2020-04-07T20:36:35.649-0400 7f00e5530700 10 mds.0.cache storing result frag [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state= 1610874889|complete|frozendir f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=1 dirty=1 0x557c450cc000]
3761 2020-04-07T20:36:35.649-0400 7f00e5530700 10 mds.0.cache.dir(0x10000000000) auth_pin by 0x557c44f8d200 on [dir 0x10000000000 /subdir/ [2,he ad] auth v=304 cv=0/0 ap=1+0 state=1610874889|complete|frozendir f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc20 20-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=1 dirty=1 authpin=1 0x557c450cc000] count now 1
3762 2020-04-07T20:36:35.649-0400 7f00e5530700 10 mds.0.cache.dir(0x10000000000) commit want 0 on [dir 0x10000000000 /subdir/ [2,head] auth v=30 4 cv=0/0 ap=1+0 state=1610878985|complete|frozendir|fragmenting f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc202 0-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=1 dirty=1 authpin=1 0x557c450cc000]
3763 2020-04-07T20:36:35.649-0400 7f00e5530700 10 mds.0.cache.dir(0x10000000000) auth_pin by 0x557c450cc000 on [dir 0x10000000000 /subdir/ [2,he ad] auth v=304 cv=0/0 ap=2+0 state=1610878985|complete|frozendir|fragmenting f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+ 0) n(v0 rc2020-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=1 dirty=1 authpin=1 0x557c450cc000] count now 2
3764 2020-04-07T20:36:35.650-0400 7f00e5530700 10 mds.0.cache.dir(0x10000000000) _commit want 304 on [dir 0x10000000000 /subdir/ [2,head] auth v =304 cv=0/0 ap=2+0 state=1610878985|complete|frozendir|fragmenting f(v0 100=100+0)/f(v0 m2020-04-07T20:35:35.443058-0400 100=100+0) n(v0 rc 2020-04-07T20:35:35.448479-0400 b188 99=99+0) hs=100+0,ss=0+0 | child=1 frozen=1 dirty=1 authpin=1 0x557c450cc000]
3765 2020-04-07T20:36:35.650-0400 7f00e5530700 10 mds.0.cache.dir(0x10000000000) marking committing
3766 2020-04-07T20:36:35.650-0400 7f00e5530700 10 mds.0.cache.dir(0x10000000000) _omap_commit
So after the dirfrag merge, no matter in which case, once the predirty_journal_parent() is called, the MDS daemon will crash anyway.
Updated by Xiubo Li about 4 years ago
The root cause of this is the cephfs-data-scan tool didn't update the InodeStore->inode->rstat, and when the MDS daemon is loading the dentries for "subdir/21" the pi->rstat will be empty, so the "subdir/" will never get a chance to count the size and file number of "subdir/21".
Updated by Xiubo Li about 4 years ago
The fixing PR: https://github.com/ceph/ceph/pull/34410
It will add accounted_rstat/rstat when building file dentry for data_scan tool and set the dnfirst=2 as default, which will fix another MDS daemon crash.
Updated by Xiubo Li about 4 years ago
- Status changed from In Progress to Fix Under Review
Updated by Greg Farnum about 4 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to octopus
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #45220: octopus: mds: MDCache.cc: 2335: FAILED ceph_assert(!"unmatched rstat rbytes" == g_conf()->mds_verify_scatter) added
Updated by Patrick Donnelly about 4 years ago
- Backport changed from octopus to octopus,nautilus
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #45497: nautilus: mds: MDCache.cc: 2335: FAILED ceph_assert(!"unmatched rstat rbytes" == g_conf()->mds_verify_scatter) added
Updated by Nathan Cutler almost 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".