Bug #56592: mds: crash when mounting a client during the scrub repair is going on - CephFS - Ceph

Actions

Copy link

Bug #56592

open

mds: crash when mounting a client during the scrub repair is going on

Added by Xiubo Li almost 2 years ago. Updated 7 months ago.

Status:

Triaged

Priority:

Normal

Assignee:

Xiubo Li

Category:

Correctness/Safety

Target version:

% Done:

Source:

Tags:

Backport:

quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash, scrub

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

13650    -18> 2022-07-18T13:43:18.426+0800 14b9e13ee700  2 mds.0.cache Memory usage:  total 242512, rss 96724, heap 57612, baseline 47372, 0 / 14 inode      s have caps, 0 caps, 0 caps per inode
13651    -17> 2022-07-18T13:43:18.678+0800 14b9e23f6700  5 mds.c ms_handle_reset on v1:10.72.47.117:0/707352342
13652    -16> 2022-07-18T13:43:18.678+0800 14b9e23f6700  3 mds.c ms_handle_reset closing connection for session client.5694 v1:10.72.47.117:0/707352342
13653    -15> 2022-07-18T13:43:18.728+0800 14b9e2ffc700  1 mds.c asok_command: scrub start {path=/,prefix=scrub start,scrubops=[recursive,repair]} (sta      rting...)
13654    -14> 2022-07-18T13:43:18.729+0800 14b9e0bea700  1 lockdep reusing last freed id 89
13655    -13> 2022-07-18T13:43:18.729+0800 14b9e0bea700  0 log_channel(cluster) log [INF] : scrub queued for path: /
13656    -12> 2022-07-18T13:43:18.729+0800 14b9e0bea700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [/]
13657    -11> 2022-07-18T13:43:18.729+0800 14b9e0bea700  0 log_channel(cluster) log [INF] : scrub summary: active paths [/]
13658    -10> 2022-07-18T13:43:18.730+0800 14b9e0bea700  0 log_channel(cluster) log [WRN] : bad backtrace on inode 0x10000000000(/mydir), rewriting it
13659     -9> 2022-07-18T13:43:18.730+0800 14b9e0bea700  0 log_channel(cluster) log [INF] : Scrub repaired inode 0x10000000000 (/mydir)
13660     -8> 2022-07-18T13:43:18.730+0800 14b9e0bea700 -1 mds.0.scrubstack _validate_inode_done scrub error on inode [inode 0x10000000000 [...2,head]       /mydir/ auth v4 pv7 ap=3 DIRTYPARENT f(v1 1=1+0) n() (inest lock dirty) (ifile lock->sync w=1 flushing) (iversion lock) | dirtyscattered=2 lock=1       dirfrag=1 dirtyrstat=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 scrubqueue=0 0x55dc6ca32680]: {"performed_validation":true,"passed_validation":f      alse,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//[]","memoryvalue":"(4)0x10000000000:[<0x1/mydir v4      >]//[]","error_str":"failed to read off disk; see retval"},"raw_stats":{"checked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v      0 1=1+0)","ondisk_value.rstat":"n(v0 1=0+1)","memory_value.dirstat":"f(v1 1=1+0)","memory_value.rstat":"n()","error_str":"freshly-calculated rsta      ts don't match existing ones (will be fixed)"},"return_code":-61}
13661     -7> 2022-07-18T13:43:18.731+0800 14b9e07e8700  5 mds.0.log _submit_thread 4200668~1381 : EUpdate scatter_writebehind [metablob 0x1, 2 dirs]
13662     -6> 2022-07-18T13:43:18.731+0800 14b9e07e8700  5 mds.0.log _submit_thread 4202069~872 : ESubtreeMap 2 subtrees , 0 ambiguous [metablob 0x1, 2       dirs]
13663     -5> 2022-07-18T13:43:18.731+0800 14b9e23f6700  5 mds.c ms_handle_reset on v1:10.72.47.117:0/2723416997
13664     -4> 2022-07-18T13:43:18.731+0800 14b9e23f6700  3 mds.c ms_handle_reset closing connection for session client.5650 v1:10.72.47.117:0/272341699      7
13665     -3> 2022-07-18T13:43:18.731+0800 14b9e11ed700  1 mds.0.cache.dir(0x10000000000) mismatch between head items and fnode.fragstat! printing dent      ries
13666     -2> 2022-07-18T13:43:18.731+0800 14b9e11ed700  1 mds.0.cache.dir(0x10000000000) get_num_head_items() = 2; fnode.fragstat.nfiles=1 fnode.frags      tat.nsubdirs=0
13667     -1> 2022-07-18T13:43:18.736+0800 14b9e11ed700 -1 /data/ceph/src/mds/ScrubStack.cc: In function 'void ScrubStack::dequeue(MDSCacheObject*)' th      read 14b9e11ed700 time 2022-07-18T13:43:18.733038+0800
13668 /data/ceph/src/mds/ScrubStack.cc: 57: FAILED ceph_assert(obj->item_scrub.is_on_list())
13669 
13670  ceph version 17.0.0-13587-gdcc92e07b25 (dcc92e07b2557170293e55675763614717c12d98) quincy (dev)
13671  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x14b9ea40c207]
13672  2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x14b9ea40c439]
13673  3: (ScrubStack::dequeue(MDSCacheObject*)+0x1cc) [0x55dc6b140cee]
13674  4: (ScrubStack::kick_off_scrubs()+0x68b) [0x55dc6b136cd5]
13675  5: (ScrubStack::remove_from_waiting(MDSCacheObject*, bool)+0xbc) [0x55dc6b137280]
13676  6: (C_RetryScrub::finish(int)+0x19) [0x55dc6b140d79]
13677  7: (MDSContext::complete(int)+0x6dc) [0x55dc6b1740b6]
13678  8: (MDSRank::_advance_queues()+0x386) [0x55dc6ae0c460]
13679  9: (MDSRank::ProgressThread::entry()+0xe19) [0x55dc6ae0d63f]
13680  10: (Thread::entry_wrapper()+0x3f) [0x14b9ea3dec29]
13681  11: (Thread::_entry_func(void*)+0x9) [0x14b9ea3dec41]
13682  12: /lib64/libpthread.so.0(+0x817a) [0x14b9e860817a]
13683  13: clone()
13684 
13685      0> 2022-07-18T13:43:18.741+0800 14b9e11ed700 -1 *** Caught signal (Aborted) **
13686  in thread 14b9e11ed700 thread_name:mds_rank_progr
13687

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Xiubo,

Were you trying to mount /mydir when it was getting repaired?

Actions

Copy link

Updated by Xiubo Li over 1 year ago

Venky Shankar wrote:

Xiubo,

Were you trying to mount /mydir when it was getting repaired?

No, I was just trying to mount the */* directory.

Actions

Copy link

Updated by Xiubo Li over 1 year ago

More info:

I was just simulating the cu case we hit by just removing one object of the directory from the metadata pool, and then run scrub command to repair it and during that just try to mount it.

Actions

Copy link