Project

General

Profile

Actions

Bug #56592

open

mds: crash when mounting a client during the scrub repair is going on

Added by Xiubo Li almost 2 years ago. Updated 7 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, scrub
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

13650    -18> 2022-07-18T13:43:18.426+0800 14b9e13ee700  2 mds.0.cache Memory usage:  total 242512, rss 96724, heap 57612, baseline 47372, 0 / 14 inode      s have caps, 0 caps, 0 caps per inode
13651    -17> 2022-07-18T13:43:18.678+0800 14b9e23f6700  5 mds.c ms_handle_reset on v1:10.72.47.117:0/707352342
13652    -16> 2022-07-18T13:43:18.678+0800 14b9e23f6700  3 mds.c ms_handle_reset closing connection for session client.5694 v1:10.72.47.117:0/707352342
13653    -15> 2022-07-18T13:43:18.728+0800 14b9e2ffc700  1 mds.c asok_command: scrub start {path=/,prefix=scrub start,scrubops=[recursive,repair]} (sta      rting...)
13654    -14> 2022-07-18T13:43:18.729+0800 14b9e0bea700  1 lockdep reusing last freed id 89
13655    -13> 2022-07-18T13:43:18.729+0800 14b9e0bea700  0 log_channel(cluster) log [INF] : scrub queued for path: /
13656    -12> 2022-07-18T13:43:18.729+0800 14b9e0bea700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [/]
13657    -11> 2022-07-18T13:43:18.729+0800 14b9e0bea700  0 log_channel(cluster) log [INF] : scrub summary: active paths [/]
13658    -10> 2022-07-18T13:43:18.730+0800 14b9e0bea700  0 log_channel(cluster) log [WRN] : bad backtrace on inode 0x10000000000(/mydir), rewriting it
13659     -9> 2022-07-18T13:43:18.730+0800 14b9e0bea700  0 log_channel(cluster) log [INF] : Scrub repaired inode 0x10000000000 (/mydir)
13660     -8> 2022-07-18T13:43:18.730+0800 14b9e0bea700 -1 mds.0.scrubstack _validate_inode_done scrub error on inode [inode 0x10000000000 [...2,head]       /mydir/ auth v4 pv7 ap=3 DIRTYPARENT f(v1 1=1+0) n() (inest lock dirty) (ifile lock->sync w=1 flushing) (iversion lock) | dirtyscattered=2 lock=1       dirfrag=1 dirtyrstat=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 scrubqueue=0 0x55dc6ca32680]: {"performed_validation":true,"passed_validation":f      alse,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//[]","memoryvalue":"(4)0x10000000000:[<0x1/mydir v4      >]//[]","error_str":"failed to read off disk; see retval"},"raw_stats":{"checked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v      0 1=1+0)","ondisk_value.rstat":"n(v0 1=0+1)","memory_value.dirstat":"f(v1 1=1+0)","memory_value.rstat":"n()","error_str":"freshly-calculated rsta      ts don't match existing ones (will be fixed)"},"return_code":-61}
13661     -7> 2022-07-18T13:43:18.731+0800 14b9e07e8700  5 mds.0.log _submit_thread 4200668~1381 : EUpdate scatter_writebehind [metablob 0x1, 2 dirs]
13662     -6> 2022-07-18T13:43:18.731+0800 14b9e07e8700  5 mds.0.log _submit_thread 4202069~872 : ESubtreeMap 2 subtrees , 0 ambiguous [metablob 0x1, 2       dirs]
13663     -5> 2022-07-18T13:43:18.731+0800 14b9e23f6700  5 mds.c ms_handle_reset on v1:10.72.47.117:0/2723416997
13664     -4> 2022-07-18T13:43:18.731+0800 14b9e23f6700  3 mds.c ms_handle_reset closing connection for session client.5650 v1:10.72.47.117:0/272341699      7
13665     -3> 2022-07-18T13:43:18.731+0800 14b9e11ed700  1 mds.0.cache.dir(0x10000000000) mismatch between head items and fnode.fragstat! printing dent      ries
13666     -2> 2022-07-18T13:43:18.731+0800 14b9e11ed700  1 mds.0.cache.dir(0x10000000000) get_num_head_items() = 2; fnode.fragstat.nfiles=1 fnode.frags      tat.nsubdirs=0
13667     -1> 2022-07-18T13:43:18.736+0800 14b9e11ed700 -1 /data/ceph/src/mds/ScrubStack.cc: In function 'void ScrubStack::dequeue(MDSCacheObject*)' th      read 14b9e11ed700 time 2022-07-18T13:43:18.733038+0800
13668 /data/ceph/src/mds/ScrubStack.cc: 57: FAILED ceph_assert(obj->item_scrub.is_on_list())
13669 
13670  ceph version 17.0.0-13587-gdcc92e07b25 (dcc92e07b2557170293e55675763614717c12d98) quincy (dev)
13671  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x14b9ea40c207]
13672  2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x14b9ea40c439]
13673  3: (ScrubStack::dequeue(MDSCacheObject*)+0x1cc) [0x55dc6b140cee]
13674  4: (ScrubStack::kick_off_scrubs()+0x68b) [0x55dc6b136cd5]
13675  5: (ScrubStack::remove_from_waiting(MDSCacheObject*, bool)+0xbc) [0x55dc6b137280]
13676  6: (C_RetryScrub::finish(int)+0x19) [0x55dc6b140d79]
13677  7: (MDSContext::complete(int)+0x6dc) [0x55dc6b1740b6]
13678  8: (MDSRank::_advance_queues()+0x386) [0x55dc6ae0c460]
13679  9: (MDSRank::ProgressThread::entry()+0xe19) [0x55dc6ae0d63f]
13680  10: (Thread::entry_wrapper()+0x3f) [0x14b9ea3dec29]
13681  11: (Thread::_entry_func(void*)+0x9) [0x14b9ea3dec41]
13682  12: /lib64/libpthread.so.0(+0x817a) [0x14b9e860817a]
13683  13: clone()
13684 
13685      0> 2022-07-18T13:43:18.741+0800 14b9e11ed700 -1 *** Caught signal (Aborted) **
13686  in thread 14b9e11ed700 thread_name:mds_rank_progr
13687 
Actions #1

Updated by Venky Shankar over 1 year ago

Xiubo,

Were you trying to mount /mydir when it was getting repaired?

Actions #2

Updated by Xiubo Li over 1 year ago

Venky Shankar wrote:

Xiubo,

Were you trying to mount /mydir when it was getting repaired?

No, I was just trying to mount the */* directory.

Actions #3

Updated by Xiubo Li over 1 year ago

More info:

I was just simulating the cu case we hit by just removing one object of the directory from the metadata pool, and then run scrub command to repair it and during that just try to mount it.

Actions #4

Updated by Venky Shankar over 1 year ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Xiubo Li
  • Target version set to v18.0.0
Actions #5

Updated by Patrick Donnelly 7 months ago

  • Target version deleted (v18.0.0)
Actions

Also available in: Atom PDF