Project

General

Profile

Bug #19204

MDS assert failed when shutting down

Added by Sangdi Xu 9 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
03/07/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Release:
jewel
Component(FS):
MDS
Needs Doc:
No

Description

We encountered a failed assertion when trying to shutdown an MDS. Here is a snippet of the log:

14> 2017-01-22 14:13:46.833804 7fd210c58700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.48:6800/42546 pipe(0x558ff3803400 sd=17 :52412 s=4 pgs=227 cs=1 l=1 c=0x558ff3758900).fault (0) Success
13> 2017-01-22 14:13:46.833802 7fd2092e6700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.12:6813/4037017 pipe(0x558ff3802000 sd=72 :32894 s=4 pgs=24 cs=1 l=1 c=0x558ffc199200).reader couldn't read tag, (0) Success
12> 2017-01-22 14:13:46.833831 7fd2092e6700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.12:6813/4037017 pipe(0x558ff3802000 sd=72 :32894 s=4 pgs=24 cs=1 l=1 c=0x558ffc199200).fault (0) Success
11> 2017-01-22 14:13:46.833884 7fd213861700 5 asok(0x558ff373a000) unregister_command objecter_requests
-10> 2017-01-22 14:13:46.833896 7fd213861700 10 monclient: shutdown
-9> 2017-01-22 14:13:46.833901 7fd213861700 1 -
192.168.36.11:6801/2188363 mark_down 0x558ffc198600 -- 0x558ffa52e000
8> 2017-01-22 14:13:46.833922 7fd2080d4700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.12:6819/4037213 pipe(0x558ff500e800 sd=82 :32834 s=4 pgs=25 cs=1 l=1 c=0x558ffc19bc00).reader couldn't read tag, (0) Success
7> 2017-01-22 14:13:46.833943 7fd2080d4700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.12:6819/4037213 pipe(0x558ff500e800 sd=82 :32834 s=4 pgs=25 cs=1 l=1 c=0x558ffc19bc00).fault (0) Success
6> 2017-01-22 14:13:46.833937 7fd214964700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.11:6789/0 pipe(0x558ffa52e000 sd=8 :52298 s=4 pgs=31815 cs=1 l=1 c=0x558ffc198600).reader couldn't read tag, (0) Success
5> 2017-01-22 14:13:46.833954 7fd214964700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.11:6789/0 pipe(0x558ffa52e000 sd=8 :52298 s=4 pgs=31815 cs=1 l=1 c=0x558ffc198600).fault (0) Success
4> 2017-01-22 14:13:46.833959 7fd210b57700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.11:6800/678824 pipe(0x558ff3804800 sd=18 :45286 s=4 pgs=198 cs=1 l=1 c=0x558ff3758c00).reader couldn't read tag, (0) Success
3> 2017-01-22 14:13:46.833972 7fd210b57700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.11:6800/678824 pipe(0x558ff3804800 sd=18 :45286 s=4 pgs=198 cs=1 l=1 c=0x558ff3758c00).fault (0) Success
2> 2017-01-22 14:13:46.834029 7fd20e437700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.48:6804/42771 pipe(0x558ff5034000 sd=33 :35778 s=4 pgs=300 cs=1 l=1 c=0x558ff375ba80).reader couldn't read tag, (0) Success
1> 2017-01-22 14:13:46.834062 7fd20e437700 2 - 192.168.36.11:6801/2188363 >> 192.168.36.48:6804/42771 pipe(0x558ff5034000 sd=33 :35778 s=4 pgs=300 cs=1 l=1 c=0x558ff375ba80).fault (0) Success
0> 2017-01-22 14:13:46.836775 7fd21285f700 -1 osdc/Objecter.cc: In function 'void Objecter::_op_submit_with_budget(Objecter::Op*, Objecter::shunique_lock&, ceph_tid_t*, int*)' thread 7fd21285f700 time 2017-01-22 14:13:46.834106
osdc/Objecter.cc: 2145: FAILED assert(initialized.read())

ceph version 10.2.5 (53ded15a3fab78780028baa5681f578254e2b9df)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x88) [0x558fe7a1ca18]
2: (Objecter::_op_submit_with_budget(Objecter::Op*, ceph::shunique_lock<boost::shared_mutex>&, unsigned long*, int*)+0x3ad) [0x558fe78b068d]
3: (Objecter::op_submit(Objecter::Op*, unsigned long*, int*)+0x6e) [0x558fe78b07ae]
4: (Filer::_probe(Filer::Probe*, std::unique_lock<std::mutex>&)+0xbea) [0x558fe788524a]
5: (Filer::_probed(Filer::Probe*, object_t const&, unsigned long, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::unique_lock<std::mutex>&)+0x9bb) [0x558fe788671b]
6: (Filer::C_Probe::finish(int)+0x6c) [0x558fe7888dac]
7: (Context::complete(int)+0x9) [0x558fe7606be9]
8: (Finisher::finisher_thread_entry()+0x4c5) [0x558fe793e305]
9: (()+0x8182) [0x7fd21d371182]
10: (clone()+0x6d) [0x7fd21b8ba47d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

It seems to be caused by improper shutdown order of MDS subsystems. The Finisher was still trying to use Objecter while it was already down.


Related issues

Copied to fs - Backport #19671: jewel: MDS assert failed when shutting down Resolved
Copied to fs - Backport #19672: kraken: MDS assert failed when shutting down Resolved

History

#1 Updated by John Spray 9 months ago

  • Status changed from New to Need Review

Hmm, we do shut down the objecter before the finisher, which is clearly not handling this case.

Let's try swapping the order, there may be subtle issues there too though.

#3 Updated by John Spray 7 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel, kraken

#4 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #19671: jewel: MDS assert failed when shutting down added

#5 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #19672: kraken: MDS assert failed when shutting down added

#6 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF