Project

General

Profile

Actions

Bug #62861

closed

mds: _submit_entry ELid(0) crashed the MDS

Added by Xiubo Li 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/teuthology/pdonnell-2023-09-12_14:07:50-fs-wip-batrick-testing-20230912.122437-distro-default-smithi/7395153/teuthology.log

   -14> 2023-09-12T17:28:18.890+0000 7ff336960700 14 mds.4.cache remove_inode [inode 0x1 [...215,head] / rep@0.1 v488 snaprealm=0x55adcf116900 f(v0 m2023-09-12T17:02:54.866508+0000 1=0+1) n(v55 rc2023-09-12T17:28:06.261956+0000 b752586534 rs27 1216=1108+108)/n(v0 rc2023-09-12T17:02:14.519913+0000 1=0+1) old_inodes=74 (inest mix) (iversion lock) 0x55adcf11f600]
   -13> 2023-09-12T17:28:18.890+0000 7ff336960700 15 mds.4.cache.ino(0x1) close_snaprealm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} last_modified 2023-09-12T17:02:14.519913+0000 change_attr 0 0x55adcf116900)
   -12> 2023-09-12T17:28:18.891+0000 7ff336960700  7 mds.4.cache sending cache_expire to 0
   -11> 2023-09-12T17:28:18.891+0000 7ff336960700  1 -- [v2:172.21.15.161:6836/2827841538,v1:172.21.15.161:6839/2827841538] send_to--> mds [v2:172.21.15.175:6832/3101275411,v1:172.21.15.175:6834/3101275411] -- cache_expire magic: 0 v1 -- ?+0 0x55adcf1e1c80
   -10> 2023-09-12T17:28:18.891+0000 7ff336960700  1 -- [v2:172.21.15.161:6836/2827841538,v1:172.21.15.161:6839/2827841538] --> [v2:172.21.15.175:6832/3101275411,v1:172.21.15.175:6834/3101275411] -- cache_expire magic: 0 v1 -- 0x55adcf1e1c80 con 0x55adcf13b800
    -9> 2023-09-12T17:28:18.891+0000 7ff336960700  5 mds.4.cache lru size now 0/0
    -8> 2023-09-12T17:28:18.891+0000 7ff336960700  7 mds.4.cache looking for subtrees to export
    -7> 2023-09-12T17:28:18.891+0000 7ff336960700 10 mds.4.cache   examining [dir 0x104 ~mds4/ [2,head] auth v=9249 cv=9249/9249 dir_auth=4 state=1073741824 f(v0 10=0+10) n(v51 rc2023-09-12T17:27:53.889216+0000 b315435 47=37+10) hs=0+0,ss=0+0 | child=0 subtree=1 subtreetemp=0 replicated=0 dirty=0 waiter=0 authpin=0 0x55adce1f6400] bounds
    -6> 2023-09-12T17:28:18.891+0000 7ff336960700 20 mds.4.bal handle_export_pins export_pin_queue size=0
    -5> 2023-09-12T17:28:18.891+0000 7ff336960700 10 mds.4.log trim_all: 1/0/0
    -4> 2023-09-12T17:28:18.891+0000 7ff336960700 20 mds.4.log _trim_expired_segments: examining LogSegment(15246/0x36f8870 events=1)
    -3> 2023-09-12T17:28:18.891+0000 7ff336960700 10 mds.4.log _trim_expired_segments waiting for expiry LogSegment(15246/0x36f8870 events=1)
    -2> 2023-09-12T17:28:18.891+0000 7ff336960700  7 mds.4.cache capping the mdlog
    -1> 2023-09-12T17:28:18.891+0000 7ff336960700 20 mds.4.log _submit_entry ELid(0)
     0> 2023-09-12T17:28:18.892+0000 7ff336960700 -1 *** Caught signal (Segmentation fault) **
 in thread 7ff336960700 thread_name:safe_timer

 ceph version 18.0.0-6088-g2110e007 (2110e00747b31a0d73768c9e5229da18b26b8aa0) reef (dev)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7ff3427cece0]
 2: (CInode::get_dirfrags() const+0x26) [0x55adcc11f106]
 3: (MDCache::advance_stray()+0x1f0) [0x55adcc0908b0]
 4: (MDLog::_start_new_segment(SegmentBoundary*)+0x465) [0x55adcc2d4975]
 5: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0xba) [0x55adcc2d4b0a]
 6: (MDLog::submit_entry(LogEvent*, MDSLogContextBase*)+0xbf) [0x55adcbfac33f]
 7: (MDCache::shutdown_pass()+0xe9f) [0x55adcc0ea30f]
 8: (MDSRankDispatcher::tick()+0x300) [0x55adcbf51d60]
 9: (Context::complete(int)+0xd) [0x55adcbf272cd]
 10: (CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x181) [0x7ff343b22af1]
 11: (CommonSafeTimerThread<ceph::fair_mutex>::entry()+0x11) [0x7ff343b23e01]
 12: /lib64/libpthread.so.0(+0x81cf) [0x7ff3427c41cf]
 13: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
  20/20 mds
  20/20 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer

Actions #1

Updated by Xiubo Li 8 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Xiubo Li
  • Pull request ID set to 53494
Actions #2

Updated by Xiubo Li 8 months ago

It's a use-after-free bug for the stray CInodes.

Actions #3

Updated by Patrick Donnelly 8 months ago

  • Category set to Correctness/Safety
  • Target version set to v19.0.0
  • Source set to Q/A
  • Labels (FS) qa-failure added
Actions #4

Updated by Venky Shankar 7 months ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF