Project

General

Profile

Bug #17670

multimds: mds entering up:replay and processing down mds aborts

Added by Patrick Donnelly 12 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Urgent
Category:
multi-MDS
Target version:
Start date:
10/22/2016
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Component(FS):
MDS
Needs Doc:
No

Description

2016-10-22 19:31:56.346594 7f73a1b7c700  5 mds.ceph-mds3 handle_mds_map epoch 21 from mon.2
2016-10-22 19:31:56.346631 7f73a1b7c700 10 mds.ceph-mds3      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2}
2016-10-22 19:31:56.346637 7f73a1b7c700 10 mds.ceph-mds3  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
2016-10-22 19:31:56.346640 7f73a1b7c700 10 mds.ceph-mds3 map says i am 192.168.180.3:6800/1449241693 mds.-1.0 state up:standby
2016-10-22 19:31:56.346645 7f73a1b7c700 10 mds.ceph-mds3 handle_mds_map: handling map in rankless mode
2016-10-22 19:31:56.346656 7f73a1b7c700 10 mds.beacon.ceph-mds3 set_want_state: up:boot -> up:standby
2016-10-22 19:31:56.346658 7f73a1b7c700  1 mds.ceph-mds3 handle_mds_map standby
2016-10-22 19:31:56.346673 7f73a1b7c700 10 mds.beacon.ceph-mds3 handle_mds_beacon up:boot seq 2 rtt 0.721590
2016-10-22 19:31:56.351818 7f73a1b7c700  5 mds.ceph-mds3 handle_mds_map epoch 22 from mon.2
2016-10-22 19:31:56.351865 7f73a1b7c700 10 mds.ceph-mds3      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2}
2016-10-22 19:31:56.351870 7f73a1b7c700 10 mds.ceph-mds3  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
2016-10-22 19:31:56.351874 7f73a1b7c700 10 mds.ceph-mds3 map says i am 192.168.180.3:6800/1449241693 mds.1.22 state up:replay
2016-10-22 19:31:56.352072 7f73a1b7c700 10 mds.ceph-mds3 handle_mds_map: initializing MDS rank 1
2016-10-22 19:31:56.352287 7f73a1b7c700 10 mds.1.0 update_log_config log_to_monitors {default=true}
2016-10-22 19:31:56.352289 7f73a1b7c700 10 mds.1.0 create_logger
2016-10-22 19:31:56.352424 7f73a1b7c700  7 mds.1.server operator(): full = 0 epoch = 0
2016-10-22 19:31:56.352444 7f73a1b7c700  4 mds.1.cache.strays operator() data pool 1 not found in OSDMap
2016-10-22 19:31:56.352491 7f73a1b7c700 10 mds.ceph-mds3 handle_mds_map: handling map as rank 1
2016-10-22 19:31:56.352494 7f73a1b7c700  1 mds.1.22 handle_mds_map i am now mds.1.22
2016-10-22 19:31:56.352495 7f73a1b7c700  1 mds.1.22 handle_mds_map state change up:boot --> up:replay
2016-10-22 19:31:56.352518 7f73a1b7c700 10 mds.beacon.ceph-mds3 set_want_state: up:standby -> up:replay
2016-10-22 19:31:56.352520 7f73a1b7c700  1 mds.1.22 replay_start
2016-10-22 19:31:56.352524 7f73a1b7c700  7 mds.1.cache set_recovery_set 0,2,3,4,5,6,7,8
2016-10-22 19:31:56.352526 7f73a1b7c700  1 mds.1.22  recovery set is 0,2,3,4,5,6,7,8
2016-10-22 19:31:56.352531 7f73a1b7c700  1 mds.1.22  waiting for osdmap 63 (which blacklists prior instance)
2016-10-22 19:31:56.358505 7f73a1b7c700 -1 /srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.2/src/mds/MDSMap.h: In function 'const entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f73a1b7c700 time 2016-10-22 19:31:56.352539
#0  0x00007f73a8aa200b in raise () from /lib64/libpthread.so.0
#1  0x00005558f96398a5 in reraise_fatal (signum=6) at /usr/src/debug/ceph-11.0.2/src/global/signal_handler.cc:72
#2  handle_fatal_signal (signum=6) at /usr/src/debug/ceph-11.0.2/src/global/signal_handler.cc:134
#3  <signal handler called>
#4  0x00007f73a7ac65c9 in raise () from /lib64/libc.so.6
#5  0x00007f73a7ac7cd8 in abort () from /lib64/libc.so.6
#6  0x00005558f96c9c57 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5558f991e945 "up.count(m)",
    file=file@entry=0x5558f991cd58 "/srv/autobuild-ceph/gitbuilder.git/build/rpmbuild/BUILD/ceph-11.0.2/src/mds/MDSMap.h", line=line@entry=584,
    func=func@entry=0x5558f99217e0 <MDSMap::get_inst(int)::__PRETTY_FUNCTION__> "const entity_inst_t MDSMap::get_inst(mds_rank_t)")
    at /usr/src/debug/ceph-11.0.2/src/common/assert.cc:78
#7  0x00005558f939e04f in MDSMap::get_inst (this=0x5559044e0000, m=2) at /usr/src/debug/ceph-11.0.2/src/mds/MDSMap.h:584
#8  0x00005558f9389fe9 in MDSRankDispatcher::handle_mds_map (this=0x5559043d6600, m=m@entry=0x555904373b00, oldmap=oldmap@entry=0x5559044e0000)
    at /usr/src/debug/ceph-11.0.2/src/mds/MDSRank.cc:1552
#9  0x00005558f936da58 in MDSDaemon::handle_mds_map (this=this@entry=0x5559044dc000, m=m@entry=0x555904373b00)
    at /usr/src/debug/ceph-11.0.2/src/mds/MDSDaemon.cc:1013
#10 0x00005558f936ef13 in MDSDaemon::handle_core_message (this=this@entry=0x5559044dc000, m=m@entry=0x555904373b00)
    at /usr/src/debug/ceph-11.0.2/src/mds/MDSDaemon.cc:1211
#11 0x00005558f936f1ab in MDSDaemon::ms_dispatch (this=0x5559044dc000, m=0x555904373b00) at /usr/src/debug/ceph-11.0.2/src/mds/MDSDaemon.cc:1166
#12 0x00005558f989f8da in ms_deliver_dispatch (m=0x555904373b00, this=0x5559043e4000) at /usr/src/debug/ceph-11.0.2/src/msg/Messenger.h:593
#13 DispatchQueue::entry (this=0x5559043e4150) at /usr/src/debug/ceph-11.0.2/src/msg/DispatchQueue.cc:197
#14 0x00005558f9740acd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /usr/src/debug/ceph-11.0.2/src/msg/DispatchQueue.h:103
#15 0x00007f73a8a9adf3 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f73a7b8701d in clone () from /lib64/libc.so.6

The MDS that failed was mds.4:

016-10-22 19:31:56.311821 7f0b1b0a3700 -1 received  signal: Terminated from  PID: 1 task name: /usr/lib/systemd/systemd --system --deserialize 21  UID: 0
2016-10-22 19:31:56.311850 7f0b1b0a3700 -1 mds.ceph-mds4 *** got signal Terminated ***
2016-10-22 19:31:56.311853 7f0b1b0a3700  1 mds.ceph-mds4 suicide.  wanted state up:active
2016-10-22 19:31:56.311938 7f0b1b0a3700 10 mds.beacon.ceph-mds4 set_want_state: up:active -> down:dne
2016-10-22 19:31:56.311950 7f0b1b0a3700 10 mds.beacon.ceph-mds4 _send down:dne seq 851
2016-10-22 19:31:56.312022 7f0b1b0a3700 20 mds.beacon.ceph-mds4 send_and_wait: awaiting 851 for up to 1s
2016-10-22 19:31:56.315008 7f0b1e8aa700 10 mds.beacon.ceph-mds4 handle_mds_beacon down:dne seq 851 rtt 0.003053
2016-10-22 19:31:56.315228 7f0b1b0a3700  1 mds.4.9 shutdown: shutting down rank 4
2016-10-22 19:31:56.315311 7f0b1b0a3700  5 mds.4.log shutdown

The bug is here:

https://github.com/ceph/ceph/blob/3eec78e5f104af71139f2c44a7b2432484cc48d4/src/mds/MDSRank.cc#L1552

I believe the fix is to check

oldmap->have_inst(*p)
in the prior if statement to check if the MDS has state with the down MDS. I have a PR incoming for that.

I think this one is kinda tough to write a test case for. Any suggestions?


Related issues

Copied to fs - Backport #17706: jewel: multimds: mds entering up:replay and processing down mds aborts Resolved

History

#1 Updated by Patrick Donnelly 12 months ago

  • Status changed from New to Need Review

#2 Updated by Patrick Donnelly 12 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel

#3 Updated by Loic Dachary 12 months ago

  • Copied to Backport #17706: jewel: multimds: mds entering up:replay and processing down mds aborts added

#4 Updated by John Spray 11 months ago

  • Target version set to v12.0.0

#5 Updated by Patrick Donnelly 9 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF