Project

General

Profile

Actions

Bug #19204

closed

MDS assert failed when shutting down

Added by Sandy Xu about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We encountered a failed assertion when trying to shutdown an MDS. Here is a snippet of the log:

14> 2017-01-22 14:13:46.833804 7fd210c58700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.48:6800/42546 pipe(0x558ff3803400 sd=17 :52412 s=4 pgs=227 cs=1 l=1 c=0x558ff3758900).fault (0) Success
13> 2017-01-22 14:13:46.833802 7fd2092e6700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.12:6813/4037017 pipe(0x558ff3802000 sd=72 :32894 s=4 pgs=24 cs=1 l=1 c=0x558ffc199200).reader couldn't read tag, (0) Success
12> 2017-01-22 14:13:46.833831 7fd2092e6700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.12:6813/4037017 pipe(0x558ff3802000 sd=72 :32894 s=4 pgs=24 cs=1 l=1 c=0x558ffc199200).fault (0) Success
11> 2017-01-22 14:13:46.833884 7fd213861700 5 asok(0x558ff373a000) unregister_command objecter_requests
-10> 2017-01-22 14:13:46.833896 7fd213861700 10 monclient: shutdown
-9> 2017-01-22 14:13:46.833901 7fd213861700 1 -
192.168.36.11:6801/2188363 mark_down 0x558ffc198600 -- 0x558ffa52e000
8> 2017-01-22 14:13:46.833922 7fd2080d4700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.12:6819/4037213 pipe(0x558ff500e800 sd=82 :32834 s=4 pgs=25 cs=1 l=1 c=0x558ffc19bc00).reader couldn't read tag, (0) Success
7> 2017-01-22 14:13:46.833943 7fd2080d4700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.12:6819/4037213 pipe(0x558ff500e800 sd=82 :32834 s=4 pgs=25 cs=1 l=1 c=0x558ffc19bc00).fault (0) Success
6> 2017-01-22 14:13:46.833937 7fd214964700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.11:6789/0 pipe(0x558ffa52e000 sd=8 :52298 s=4 pgs=31815 cs=1 l=1 c=0x558ffc198600).reader couldn't read tag, (0) Success
5> 2017-01-22 14:13:46.833954 7fd214964700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.11:6789/0 pipe(0x558ffa52e000 sd=8 :52298 s=4 pgs=31815 cs=1 l=1 c=0x558ffc198600).fault (0) Success
4> 2017-01-22 14:13:46.833959 7fd210b57700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.11:6800/678824 pipe(0x558ff3804800 sd=18 :45286 s=4 pgs=198 cs=1 l=1 c=0x558ff3758c00).reader couldn't read tag, (0) Success
3> 2017-01-22 14:13:46.833972 7fd210b57700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.11:6800/678824 pipe(0x558ff3804800 sd=18 :45286 s=4 pgs=198 cs=1 l=1 c=0x558ff3758c00).fault (0) Success
2> 2017-01-22 14:13:46.834029 7fd20e437700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.48:6804/42771 pipe(0x558ff5034000 sd=33 :35778 s=4 pgs=300 cs=1 l=1 c=0x558ff375ba80).reader couldn't read tag, (0) Success
1> 2017-01-22 14:13:46.834062 7fd20e437700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.48:6804/42771 pipe(0x558ff5034000 sd=33 :35778 s=4 pgs=300 cs=1 l=1 c=0x558ff375ba80).fault (0) Success
0> 2017-01-22 14:13:46.836775 7fd21285f700 -1 osdc/Objecter.cc: In function 'void Objecter::_op_submit_with_budget(Objecter::Op*, Objecter::shunique_lock&, ceph_tid_t*, int*)' thread 7fd21285f700 time 2017-01-22 14:13:46.834106
osdc/Objecter.cc: 2145: FAILED assert(initialized.read())

ceph version 10.2.5 (53ded15a3fab78780028baa5681f578254e2b9df)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x88) [0x558fe7a1ca18]
2: (Objecter::_op_submit_with_budget(Objecter::Op*, ceph::shunique_lock<boost::shared_mutex>&, unsigned long*, int*)+0x3ad) [0x558fe78b068d]
3: (Objecter::op_submit(Objecter::Op*, unsigned long*, int*)+0x6e) [0x558fe78b07ae]
4: (Filer::_probe(Filer::Probe*, std::unique_lock<std::mutex>&)+0xbea) [0x558fe788524a]
5: (Filer::_probed(Filer::Probe*, object_t const&, unsigned long, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::unique_lock<std::mutex>&)+0x9bb) [0x558fe788671b]
6: (Filer::C_Probe::finish(int)+0x6c) [0x558fe7888dac]
7: (Context::complete(int)+0x9) [0x558fe7606be9]
8: (Finisher::finisher_thread_entry()+0x4c5) [0x558fe793e305]
9: (()+0x8182) [0x7fd21d371182]
10: (clone()+0x6d) [0x7fd21b8ba47d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

It seems to be caused by improper shutdown order of MDS subsystems. The Finisher was still trying to use Objecter while it was already down.


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #19671: jewel: MDS assert failed when shutting downResolvedNathan CutlerActions
Copied to CephFS - Backport #19672: kraken: MDS assert failed when shutting downResolvedNathan CutlerActions
Actions #1

Updated by John Spray about 7 years ago

  • Status changed from New to Fix Under Review

Hmm, we do shut down the objecter before the finisher, which is clearly not handling this case.

Let's try swapping the order, there may be subtle issues there too though.

Actions #3

Updated by John Spray about 7 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to jewel, kraken
Actions #4

Updated by Nathan Cutler about 7 years ago

  • Copied to Backport #19671: jewel: MDS assert failed when shutting down added
Actions #5

Updated by Nathan Cutler about 7 years ago

  • Copied to Backport #19672: kraken: MDS assert failed when shutting down added
Actions #6

Updated by Nathan Cutler almost 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF