Project

General

Profile

Actions

Bug #20494

closed

cephfs_data_scan: try_remove_dentries_for_stray assertion failure

Added by Ivan Guan almost 7 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
fsck/damage handling
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
tools
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Using teuthology run data-scan.yaml and when completed test_data_scan.py:test_parallel_execution test case i want to delete all dentries of my mointpoint.The mds core somehow when i run "rm -rf *" under my mointpoint.

the stack is:
#0 0x00007f16a6b3bfcb in raise () from /lib64/libpthread.so.0
#1 0x00007f16a7c610d5 in reraise_fatal (signum=6) at global/signal_handler.cc:71
#2 handle_fatal_signal (signum=6) at global/signal_handler.cc:133
#3 <signal handler called>
#4 0x00007f16a55405f7 in raise () from /lib64/libc.so.6
#5 0x00007f16a5541ce8 in abort () from /lib64/libc.so.6
#6 0x00007f16a7d5b0c7 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f16a7eead67 "dn->get_linkage()->is_null()",
file=file@entry=0x7f16a7f00492 "mds/CDir.cc", line=line@entry=697,
func=func@entry=0x7f16a7f02d40 <CDir::try_remove_dentries_for_stray()::__PRETTY_FUNCTION__> "void CDir::try_remove_dentries_for_stray()") at common/assert.cc:78
#7 0x00007f16a7b30860 in CDir::try_remove_dentries_for_stray (this=0x7f16b2904400) at mds/CDir.cc:697
#8 0x00007f16a7ab9b49 in StrayManager::__eval_stray (this=0x7f16b28c0b28, dn=dn@entry=0x7f16b2953c20, delay=<optimized out>) at mds/StrayManager.cc:575
#9 0x00007f16a7ab9fee in StrayManager::eval_stray (this=<optimized out>, dn=0x7f16b2953c20, delay=<optimized out>) at mds/StrayManager.cc:656
#10 0x00007f16a7a05e61 in put (by=-1003, this=0x7f16b2953c20) at mds/mdstypes.h:1521
#11 MutationImpl::drop_pins (this=0x7f16b2ab2d00) at mds/Mutation.cc:53
#12 0x00007f16a7a2bf28 in MDCache::request_cleanup (this=this@entry=0x7f16b28c0000, mdr=std::shared_ptr (count 3, weak 0) 0x7f16b2ab2d00) at mds/MDCache.cc:9003
#13 0x00007f16a7a2c3c1 in MDCache::request_finish (this=0x7f16b28c0000, mdr=std::shared_ptr (count 3, weak 0) 0x7f16b2ab2d00) at mds/MDCache.cc:8853
#14 0x00007f16a79b16a8 in Server::reply_client_request (this=this@entry=0x7f16b27f59d0, mdr=std::shared_ptr (count 3, weak 0) 0x7f16b2ab2d00,
reply=reply@entry=0x7f16b2b84b00) at mds/Server.cc:1210
#15 0x00007f16a79b22a1 in Server::respond_to_request (this=this@entry=0x7f16b27f59d0, mdr=std::shared_ptr (count 3, weak 0) 0x7f16b2ab2d00, r=r@entry=0)
at mds/Server.cc:1040
#16 0x00007f16a79bd4df in Server::_unlink_local_finish (this=0x7f16b27f59d0, mdr=std::shared_ptr (count 3, weak 0) 0x7f16b2ab2d00, dn=0x7f16b2952000,
straydn=0x7f16b2953c20, dnpv=4) at mds/Server.cc:5666
#17 0x00007f16a7bb054b in complete (r=0, this=0x7f16b27c1440) at include/Context.h:64
#18 MDSInternalContextBase::complete (this=0x7f16b27c1440, r=0) at mds/MDSContext.cc:30
#19 0x00007f16a7bb054b in complete (r=0, this=0x7f16b2af8c60) at include/Context.h:64
#20 MDSInternalContextBase::complete (this=0x7f16b2af8c60, r=0) at mds/MDSContext.cc:30
#21 0x00007f16a7bc7463 in C_MDL_Flushed::finish (this=0x7f16b291bd20, r=<optimized out>) at mds/MDLog.cc:350
#22 0x00007f16a7bb0874 in complete (r=0, this=0x7f16b291bd20) at include/Context.h:64
#23 MDSIOContextBase::complete (this=0x7f16b291bd20, r=0) at mds/MDSContext.cc:65
#24 0x00007f16a7c8a856 in Finisher::finisher_thread_entry (this=0x7f16b28b26e0) at common/Finisher.cc:68
#25 0x00007f16a6b34dc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007f16a560128d in clone () from /lib64/libc.so.6

ceph_version:jewel ceph-v10.2.2

Bug Description:
My mointpoint have one dir and it have 25 files,like
/subdir/0
/subidr/1
/subidr/2
...
/subdir/24

Root cause is that the subdir's fnode.nfiles is 1 when completed cephfs-data-scan.As everyone know all files will be deleted before their parent director 'subdir' be deleted.The emphasis is when deleted the first file subdir's fnode.nfils became to 0,so the 'subdir' directory be deleted deservedly which will lead to mds core.

Solutions:
First:
We can subtract dir->fnoe.nfiles when do 'rodos rmomapkey' and increase it when do cephfs-data-scan correspondingly.

Second:
We don't need subtract dir->fnode,nfiles when do 'rados rmomapkey' but we have to increate it when the file's parent directry‘s omap_header not exist and we create it manually during cephfs-data-scan.

Actions #1

Updated by Ivan Guan almost 7 years ago

After carefull consideration, i try to fix this bug using first solution.
function inject_with_backtrace will traverse a ojbect's ancestor
when we inject_linkage one dentry, we should increate parent dir's fnode.fragstat.nfiles if we didn't create parent directory's omap_header first cycle.If not first cycle we should increase parent dir's fnode.fragstat'nsbudirs after inject_linkage.In order to all hierarchical diretores have most actual fragstat we should traverse all ancestor of the object, do you tink so?

Actions #2

Updated by Zheng Yan almost 7 years ago

This inconsistent state is created by cephfs-data-scan? I think we can avoid injecting dentry to stray directory (inject it into lost+found instead). MetadataDriver::inject_with_backtrace() calls MetadataDriver::get_frag_of(), MetadataDriver::get_frag_of() loads the directory inode. it's easy to check if the inode is stray there.

Actions #3

Updated by Ivan Guan almost 7 years ago

Ivan Guan wrote:

After carefull consideration, i try to fix this bug using first solution.
function inject_with_backtrace will traverse a ojbect's ancestor
when we inject_linkage one dentry, we should increate parent dir's fnode.fragstat.nfiles if we didn't create parent directory's omap_header first cycle.If not first cycle we should increase parent dir's fnode.fragstat'nsbudirs after inject_linkage.In order to all hierarchical diretores have most actual fragstat we should traverse all ancestor of the object, do you tink so?

My solution don't change if dentry shuold be injected to stray diretroy or lost+found.I only want update parent directories omap_headr after we function inject_linkage.If we inject a file crosponddingly nfiles should be increased but if inject a diretory we shuold increase the nsbudirs.After my test,do this can ensure all hierarchical diretories have right statistics of nfils and nsbudirs.

Actions #4

Updated by Zheng Yan almost 7 years ago

Ivan Guan wrote:

Ivan Guan wrote:

After carefull consideration, i try to fix this bug using first solution.
function inject_with_backtrace will traverse a ojbect's ancestor
when we inject_linkage one dentry, we should increate parent dir's fnode.fragstat.nfiles if we didn't create parent directory's omap_header first cycle.If not first cycle we should increase parent dir's fnode.fragstat'nsbudirs after inject_linkage.In order to all hierarchical diretores have most actual fragstat we should traverse all ancestor of the object, do you tink so?

My solution don't change if dentry shuold be injected to stray diretroy or lost+found.I only want update parent directories omap_headr after we function inject_linkage.If we inject a file crosponddingly nfiles should be increased but if inject a diretory we shuold increase the nsbudirs.After my test,do this can ensure all hierarchical diretories have right statistics of nfils and nsbudirs.

This is better, but it still can't ensure fragstats/rstats are correct. Nothing can be 100% trusted during fs recovery. Running "ceph daemon mds.a scrub_path / force recursive repair" can fix wrong fragstat/rstat.

Actions #5

Updated by Zheng Yan almost 7 years ago

I fixed a cephfs-data-scan bug. https://github.com/ceph/ceph/pull/16202. It can explain this inconsistent state.

Actions #6

Updated by Ivan Guan almost 7 years ago

Thank you,i made a test of running "ceph daemon mds.a scrub_path / force recursive repair" after scan_inodes.As you said all hierarchical directories have the right fragstat/rstats,but another i find the other serious problem inotable's free and projected_free is {
"0": {
"data": {
"version": 1,
"inotable": {
"projected_free": [ {
"start": 1099511627776,
"len": 1099511627776
}
],
"free": [ {
"start": 1099511627776,
"len": 1099511627776
}
]
}
},
"result": 0
}
}

we i touch a file under mointpoint mds core
2017-07-07 15:45:40.327538 7f7fa12f7700 20 mds.0.sessionmap mark_projected s=0x7f7fb1ff4a80 name=client.18705 pv=4 > 5
2017-07-07 15:45:40.327540 7f7fa12f7700 10 mds.0.server prepare_new_inode prealloc [10000000001~3e8]
2017-07-07 15:45:40.327542 7f7fa12f7700 10 mds.0.server dir mode 041777 new mode 0100644
2017-07-07 15:45:40.330619 7f7fa12f7700 -1 mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f7fa12f7700 time 2017-07-07 15:45:40.327548
mds/MDCache.cc: 265: FAILED assert(inode_map.count(in
>vino()) == 0)

ceph version 10.2.2-37-g75b77cc (75b77cca7e6b7f3187518acded5941ffe8f33272)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f7fa7733ee5]
2: (MDCache::add_inode(CInode*)+0x246) [0x7f7fa73fab76]
3: (Server::prepare_new_inode(std::shared_ptr&lt;MDRequestImpl&gt;&, CDir*, inodeno_t, unsigned int, file_layout_t*)+0x1111) [0x7f7fa7388601]
4: (Server::handle_client_openc(std::shared_ptr&lt;MDRequestImpl&gt;&)+0xe32) [0x7f7fa73b0a02]
5: (Server::dispatch_client_request(std::shared_ptr&lt;MDRequestImpl&gt;&)+0xa88) [0x7f7fa73be0d8]
6: (Server::handle_client_request(MClientRequest*)+0x48c) [0x7f7fa73be69c]
7: (Server::dispatch(Message*)+0x3eb) [0x7f7fa73c2b0b]
8: (MDSRank::handle_deferrable_message(Message*)+0x82c) [0x7f7fa733d02c]
9: (MDSRank::_dispatch(Message*, bool)+0x207) [0x7f7fa7346807]
10: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x7f7fa7347995]
11: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x7f7fa732dd03]
12: (DispatchQueue::entry()+0x78a) [0x7f7fa78336aa]
13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f7fa7718e2d]
14: (()+0x7dc5) [0x7f7fa650ddc5]
15: (clone()+0x6d) [0x7f7fa4fda28d]
Actions #7

Updated by Ivan Guan almost 7 years ago

Zheng Yan wrote:

I fixed a cephfs-data-scan bug. https://github.com/ceph/ceph/pull/16202. It can explain this inconsistent state.

For 'cephfs-journal-tool event recover_dentries summary' i think it only can replay the entry which still not be trimmed.In test_data_scan.py we will run
'cephfs-journal-tool journal reset --force',which will clean all the journal.So recover_dentries can't paly any role,what do you think about it?

Actions #8

Updated by Zheng Yan almost 7 years ago

you can use cephfs-table-tool manually remove used inodes from inotable. (use "rados -p data ls", "rados -p metadata ls" to find the largest used ino number)

I'm writing a patch the for cephfs-data-scan, make it update inotable automatically

Actions #9

Updated by Ivan Guan almost 7 years ago

Zheng Yan wrote:

you can use cephfs-table-tool manually remove used inodes from inotable. (use "rados -p data ls", "rados -p metadata ls" to find the largest used ino number)

I'm writing a patch the for cephfs-data-scan, make it update inotable automatically

That would be great which will effectively prevent ino reuse.however,i just found out that the "ceph daemon mds.a scrub_path / force recursive repair" can also repaire inotable.Unfortunately,there is a
'EResetJournal' evnet generated by 'cephfs-journal-tool --debug-mds=20 --debug-monc=1 --debug-objecter=20 --debug-rados=20 --cluster xtao journal reset --force' which
make mds skip 989680 inos in test-data-scan.py.All my inos less then 10000989680,so them have not removed from free we excute CInode::validate_disk_state.

mds.log:
2017-07-07 17:13:50.456023 7f014b647700 1 mds.0.journaler(ro) recover start
2017-07-07 17:13:50.456033 7f014b647700 1 mds.0.journaler(ro) read_head
2017-07-07 17:13:50.456208 7f014b647700 4 mds.0.log Waiting for journal 200 to recover...
...
2017-07-07 17:13:50.477537 7f0149f3e700 10 mds.0.log _replay 4194304~28 / 4194352 0.000000: EResetJournal
2017-07-07 17:13:50.477550 7f0149f3e700 1 mds.0.journal EResetJournal
2017-07-07 17:13:50.477557 7f0149f3e700 1 mds.0.sessionmap wipe start
2017-07-07 17:13:50.477561 7f0149f3e700 10 mds.0.sessionmap dump
2017-07-07 17:13:50.477564 7f0149f3e700 1 mds.0.sessionmap wipe result
2017-07-07 17:13:50.477566 7f0149f3e700 10 mds.0.sessionmap dump
2017-07-07 17:13:50.477568 7f0149f3e700 1 mds.0.sessionmap wipe done
2017-07-07 17:13:50.477574 7f0149f3e700 10 mds.0.inotable: replay_reset [10000000000~10000000000]
2017-07-07 17:13:50.477585 7f0149f3e700 10 mds.0.inotable: skip_inos was [10000000000~10000000000]
2017-07-07 17:13:50.477594 7f0149f3e700 10 mds.0.inotable: skip_inos now [10000989680~ffff676980]

....
2017-07-07 17:20:54.994468 7f014c74a700 20 mds.0.scrubstack popping [inode 1 [...2,head] / auth v2 snaprealm=0x7f01619dfa80 f(v0 1=1+0) n() (iversion lock) | request=0 lock=0 dirfrag=1 caps=0 dirtyparent=0 scrubqueue=1 dirty=0 authpin=0 0x7f0161aee5d0 mode: 17407] off of ScrubStack
2017-07-07 17:20:54.994494 7f014ae46700 20 mds.0.log _submit_thread 4219550~512 : EUpdate repair_dirfrag [metablob 10000000000, 1 dirs]
2017-07-07 17:20:54.994499 7f014ae46700 10 mds.0.journaler(rw) append_entry len 512 to 4219550~532
2017-07-07 17:20:54.994503 7f014ae46700 20 mds.0.journaler(rw) flush not delaying flush
2017-07-07 17:20:54.994505 7f014ae46700 10 mds.0.journaler(rw) _do_flush flushing 4219550~532
2017-07-07 17:20:54.994533 7f014ae46700 10 mds.0.journaler(rw) _do_flush (prezeroing/prezero)/write/flush/safe pointers now at (29360128/29360128)/4220082/4220082/4214844
2017-07-07 17:20:54.994535 7f014ae46700 20 mds.0.journaler(rw) _issue_prezero target 29360128 <= prezeroing_pos 29360128
2017-07-07 17:20:54.995968 7f014c74a700 10 MDSIOContextBase::complete: 19MDSIOContextWrapper
2017-07-07 17:20:54.995974 7f014c74a700 10 mds.0.cache.ino(1) decoded 30 bytes of backtrace successfully
2017-07-07 17:20:54.995977 7f014c74a700 10 mds.0.cache.ino(1) scrub*: inotable ino = 0x1*
2017-07-07 17:20:54.995979 7f014c74a700 10 mds.0.cache.ino(1) scrub: inotable free says 0

Actions #10

Updated by Zheng Yan almost 7 years ago

If inode numbers were removed from inotable when replaying journal. how did "assert(inode_map.count(in->vino()) == 0)" happened, how did client session get prealloc inos [10000000001~3e8]?

Actions #11

Updated by Ivan Guan almost 7 years ago

Zheng Yan wrote:

If inode numbers were removed from inotable when replaying journal. how did "assert(inode_map.count(in->vino()) 0)" happened, how did client session get prealloc inos [10000000001~3e8]?

I did several times test case,but not reproduce ‘FAILED assert(inode_map.count(in->vino()) 0)’ and the ino of the file i touched is 1099521627776(10000989680).I think i might run 'cephfs-journal event splice --type=RESETJOURNAL summary' delete the journal entry or any operation else.If ‘FAILED assert(inode_map.count(in->vino()) == 0)’ happened again i will pay more attention on it.

Actions #12

Updated by Patrick Donnelly about 6 years ago

  • Subject changed from cephfs_datat_scan:try_remove_dentries_for_stray assertion failure to cephfs_data_scan: try_remove_dentries_for_stray assertion failure
  • Status changed from New to Closed
  • Assignee changed from Vishal Kanaujia to Zheng Yan
  • Target version deleted (v10.2.8)
  • Source set to Community (user)
  • Component(FS) tools added

Closing due to inactivity.

Actions

Also available in: Atom PDF