Actions
Bug #17670
closedmultimds: mds entering up:replay and processing down mds aborts
% Done:
0%
Source:
Development
Tags:
Backport:
jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2016-10-22 19:31:56.346594 7f73a1b7c700 5 mds.ceph-mds3 handle_mds_map epoch 21 from mon.2 2016-10-22 19:31:56.346631 7f73a1b7c700 10 mds.ceph-mds3 my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2} 2016-10-22 19:31:56.346637 7f73a1b7c700 10 mds.ceph-mds3 mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} 2016-10-22 19:31:56.346640 7f73a1b7c700 10 mds.ceph-mds3 map says i am 192.168.180.3:6800/1449241693 mds.-1.0 state up:standby 2016-10-22 19:31:56.346645 7f73a1b7c700 10 mds.ceph-mds3 handle_mds_map: handling map in rankless mode 2016-10-22 19:31:56.346656 7f73a1b7c700 10 mds.beacon.ceph-mds3 set_want_state: up:boot -> up:standby 2016-10-22 19:31:56.346658 7f73a1b7c700 1 mds.ceph-mds3 handle_mds_map standby 2016-10-22 19:31:56.346673 7f73a1b7c700 10 mds.beacon.ceph-mds3 handle_mds_beacon up:boot seq 2 rtt 0.721590 2016-10-22 19:31:56.351818 7f73a1b7c700 5 mds.ceph-mds3 handle_mds_map epoch 22 from mon.2 2016-10-22 19:31:56.351865 7f73a1b7c700 10 mds.ceph-mds3 my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2} 2016-10-22 19:31:56.351870 7f73a1b7c700 10 mds.ceph-mds3 mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} 2016-10-22 19:31:56.351874 7f73a1b7c700 10 mds.ceph-mds3 map says i am 192.168.180.3:6800/1449241693 mds.1.22 state up:replay 2016-10-22 19:31:56.352072 7f73a1b7c700 10 mds.ceph-mds3 handle_mds_map: initializing MDS rank 1 2016-10-22 19:31:56.352287 7f73a1b7c700 10 mds.1.0 update_log_config log_to_monitors {default=true} 2016-10-22 19:31:56.352289 7f73a1b7c700 10 mds.1.0 create_logger 2016-10-22 19:31:56.352424 7f73a1b7c700 7 mds.1.server operator(): full = 0 epoch = 0 2016-10-22 19:31:56.352444 7f73a1b7c700 4 mds.1.cache.strays operator() data pool 1 not found in OSDMap 2016-10-22 19:31:56.352491 7f73a1b7c700 10 mds.ceph-mds3 handle_mds_map: handling map as rank 1 2016-10-22 19:31:56.352494 7f73a1b7c700 1 mds.1.22 handle_mds_map i am now mds.1.22 2016-10-22 19:31:56.352495 7f73a1b7c700 1 mds.1.22 handle_mds_map state change up:boot --> up:replay 2016-10-22 19:31:56.352518 7f73a1b7c700 10 mds.beacon.ceph-mds3 set_want_state: up:standby -> up:replay 2016-10-22 19:31:56.352520 7f73a1b7c700 1 mds.1.22 replay_start 2016-10-22 19:31:56.352524 7f73a1b7c700 7 mds.1.cache set_recovery_set 0,2,3,4,5,6,7,8 2016-10-22 19:31:56.352526 7f73a1b7c700 1 mds.1.22 recovery set is 0,2,3,4,5,6,7,8 2016-10-22 19:31:56.352531 7f73a1b7c700 1 mds.1.22 waiting for osdmap 63 (which blacklists prior instance) 2016-10-22 19:31:56.358505 7f73a1b7c700 -1 /srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.2/src/mds/MDSMap.h: In function 'const entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f73a1b7c700 time 2016-10-22 19:31:56.352539 #0 0x00007f73a8aa200b in raise () from /lib64/libpthread.so.0 #1 0x00005558f96398a5 in reraise_fatal (signum=6) at /usr/src/debug/ceph-11.0.2/src/global/signal_handler.cc:72 #2 handle_fatal_signal (signum=6) at /usr/src/debug/ceph-11.0.2/src/global/signal_handler.cc:134 #3 <signal handler called> #4 0x00007f73a7ac65c9 in raise () from /lib64/libc.so.6 #5 0x00007f73a7ac7cd8 in abort () from /lib64/libc.so.6 #6 0x00005558f96c9c57 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5558f991e945 "up.count(m)", file=file@entry=0x5558f991cd58 "/srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.2/src/mds/MDSMap.h", line=line@entry=584, func=func@entry=0x5558f99217e0 <MDSMap::get_inst(int)::__PRETTY_FUNCTION__> "const entity_inst_t MDSMap::get_inst(mds_rank_t)") at /usr/src/debug/ceph-11.0.2/src/common/assert.cc:78 #7 0x00005558f939e04f in MDSMap::get_inst (this=0x5559044e0000, m=2) at /usr/src/debug/ceph-11.0.2/src/mds/MDSMap.h:584 #8 0x00005558f9389fe9 in MDSRankDispatcher::handle_mds_map (this=0x5559043d6600, m=m@entry=0x555904373b00, oldmap=oldmap@entry=0x5559044e0000) at /usr/src/debug/ceph-11.0.2/src/mds/MDSRank.cc:1552 #9 0x00005558f936da58 in MDSDaemon::handle_mds_map (this=this@entry=0x5559044dc000, m=m@entry=0x555904373b00) at /usr/src/debug/ceph-11.0.2/src/mds/MDSDaemon.cc:1013 #10 0x00005558f936ef13 in MDSDaemon::handle_core_message (this=this@entry=0x5559044dc000, m=m@entry=0x555904373b00) at /usr/src/debug/ceph-11.0.2/src/mds/MDSDaemon.cc:1211 #11 0x00005558f936f1ab in MDSDaemon::ms_dispatch (this=0x5559044dc000, m=0x555904373b00) at /usr/src/debug/ceph-11.0.2/src/mds/MDSDaemon.cc:1166 #12 0x00005558f989f8da in ms_deliver_dispatch (m=0x555904373b00, this=0x5559043e4000) at /usr/src/debug/ceph-11.0.2/src/msg/Messenger.h:593 #13 DispatchQueue::entry (this=0x5559043e4150) at /usr/src/debug/ceph-11.0.2/src/msg/DispatchQueue.cc:197 #14 0x00005558f9740acd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /usr/src/debug/ceph-11.0.2/src/msg/DispatchQueue.h:103 #15 0x00007f73a8a9adf3 in start_thread () from /lib64/libpthread.so.0 #16 0x00007f73a7b8701d in clone () from /lib64/libc.so.6
The MDS that failed was mds.4:
016-10-22 19:31:56.311821 7f0b1b0a3700 -1 received signal: Terminated from PID: 1 task name: /usr/lib/systemd/systemd --system --deserialize 21 UID: 0 2016-10-22 19:31:56.311850 7f0b1b0a3700 -1 mds.ceph-mds4 *** got signal Terminated *** 2016-10-22 19:31:56.311853 7f0b1b0a3700 1 mds.ceph-mds4 suicide. wanted state up:active 2016-10-22 19:31:56.311938 7f0b1b0a3700 10 mds.beacon.ceph-mds4 set_want_state: up:active -> down:dne 2016-10-22 19:31:56.311950 7f0b1b0a3700 10 mds.beacon.ceph-mds4 _send down:dne seq 851 2016-10-22 19:31:56.312022 7f0b1b0a3700 20 mds.beacon.ceph-mds4 send_and_wait: awaiting 851 for up to 1s 2016-10-22 19:31:56.315008 7f0b1e8aa700 10 mds.beacon.ceph-mds4 handle_mds_beacon down:dne seq 851 rtt 0.003053 2016-10-22 19:31:56.315228 7f0b1b0a3700 1 mds.4.9 shutdown: shutting down rank 4 2016-10-22 19:31:56.315311 7f0b1b0a3700 5 mds.4.log shutdown
The bug is here:
https://github.com/ceph/ceph/blob/3eec78e5f104af71139f2c44a7b2432484cc48d4/src/mds/MDSRank.cc#L1552
I believe the fix is to check
oldmap->have_inst(*p)in the prior if statement to check if the MDS has state with the down MDS. I have a PR incoming for that.
I think this one is kinda tough to write a test case for. Any suggestions?
Updated by Patrick Donnelly over 7 years ago
- Status changed from New to Fix Under Review
Updated by Patrick Donnelly over 7 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to jewel
Updated by Loïc Dachary over 7 years ago
- Copied to Backport #17706: jewel: multimds: mds entering up:replay and processing down mds aborts added
Updated by Patrick Donnelly over 7 years ago
- Status changed from Pending Backport to Resolved
Updated by Patrick Donnelly about 5 years ago
- Category deleted (
90) - Labels (FS) multimds added
Actions