Project

General

Profile

Bug #35848

MDSMonitor: lookup of gid in prepare_beacon that has been removed will cause exception

Added by Patrick Donnelly 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Category:
Correctness/Safety
Target version:
Start date:
09/07/2018
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
crash
Pull request ID:

Description

2018-09-07 06:28:53.829359 7fe856397700  1 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 fail_mds_gid 4864 mds.ceph-sshreeka-1536308179377-node6-mds role 0
2018-09-07 06:28:53.829589 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 prepare_beacon pending map now:
2018-09-07 06:28:53.829601 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 15 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2}
2018-09-07 06:28:53.829639 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 16 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2}
2018-09-07 06:28:53.829658 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 17 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2}
2018-09-07 06:28:53.829667 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 2 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2}
2018-09-07 06:28:53.829677 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 3 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2}
2018-09-07 06:28:53.829694 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 4 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2}
2018-09-07 06:28:53.832094 7fe856397700  4 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 filesystem_command prefix='mds fail'
2018-09-07 06:28:53.832105 7fe856397700  1 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 gid_from_arg: rank/GID 0 not a existent rank or GID
2018-09-07 06:28:53.832107 7fe856397700  4 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 prepare_command done, r=0
2018-09-07 06:28:53.832138 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4864/ceph-sshreeka-1536308179377-node6-mds down:damaged seq 9 v166) v7 from mds.0 172.16.115.21:6800/2937610127 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2}
2018-09-07 06:28:53.832158 7fe856397700  5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 _note_beacon mdsbeacon(4864/ceph-sshreeka-1536308179377-node6-mds down:damaged seq 9 v166) v7 noting time
2018-09-07 06:28:53.839276 7fe856397700 -1 *** Caught signal (Aborted) **
 in thread 7fe856397700 thread_name:fn_monstore

 ceph version 12.2.4-42.1.hotfix.nvidia.el7cp (4a72ecd06cdc5a049945b166073ce39fbe631308) luminous (stable)
 1: (()+0x931071) [0x5611702d4071]
 2: (()+0xf680) [0x7fe86390e680]
 3: (gsignal()+0x37) [0x7fe860c49207]
 4: (abort()+0x148) [0x7fe860c4a8f8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fe8615587d5]
 6: (()+0x5e746) [0x7fe861556746]
 7: (()+0x5e773) [0x7fe861556773]
 8: (()+0x5e993) [0x7fe861556993]
 9: (std::__throw_out_of_range(char const*)+0x77) [0x7fe8615ab857]
 10: (FSMap::get_info_gid(mds_gid_t) const+0xfc) [0x56116ff5e1ac]
 11: (MDSMonitor::prepare_beacon(boost::intrusive_ptr<MonOpRequest>)+0x77d) [0x56116ff5190d]
 12: (MDSMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x257) [0x56116ff58d97]
 13: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x56116feb49d8]
 14: (PaxosService::C_RetryMessage::_finish(int)+0x5e) [0x56116fdee3fe]
 15: (Context::complete(int)+0x9) [0x56116fd9b7b9]
 16: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x56116fda514c]
 17: (Paxos::finish_round()+0x11e) [0x56116fea5a0e]
 18: (Paxos::commit_finish()+0x71d) [0x56116fea6b0d]
 19: (C_Committed::finish(int)+0x31) [0x56116feae961]
 20: (Context::complete(int)+0x9) [0x56116fd9b7b9]
 21: (MonitorDBStore::C_DoTransaction::finish(int)+0xa7) [0x56116feadb57]
 22: (Context::complete(int)+0x9) [0x56116fd9b7b9]
 23: (Finisher::finisher_thread_entry()+0x198) [0x56116ffd0558]
 24: (()+0x7dd5) [0x7fe863906dd5]
 25: (clone()+0x6d) [0x7fe860d11b3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Problem introduced by this change: https://github.com/ceph/ceph/commit/624efc64323f99b2e843f376879c1080276e036f#diff-6c4f848e4bb0fe57e9c0f9bc67b14beaL354

The beacons are no longer dropped if the gid was removed from the pending_fsmap. We need to do a new check in prepare_beacon which operates on pending_fsmap.


Related issues

Related to fs - Bug #35850: mds: runs out of file descriptors after several respawns Pending Backport 09/07/2018
Copied to fs - Backport #35858: mimic: MDSMonitor: lookup of gid in prepare_beacon that has been removed will cause exception Resolved
Copied to fs - Backport #35859: luminous: MDSMonitor: lookup of gid in prepare_beacon that has been removed will cause exception Resolved

History

#1 Updated by Patrick Donnelly 3 months ago

  • Status changed from New to Verified
(gdb) bt
#0  0x00007fffed3e9428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007fffed3eb02a in __GI_abort () at abort.c:89
#2  0x00007fffedd300d5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffedd2dcc6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffedd2dd11 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fffedd2df54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fffedd579af in std::__throw_out_of_range(char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000000100487c00 in std::map<mds_gid_t, int, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, int> > >::at (__k=..., this=<optimized out>) at /usr/include/c++/7/bits/stl_map.h:533
#8  FSMap::get_info_gid (this=this@entry=0x101684dc8, gid=...) at /home/pdonnell/ceph/src/mds/FSMap.h:357
#9  0x000000010047cbd6 in MDSMonitor::prepare_beacon (this=this@entry=0x101684b00, op=...) at /home/pdonnell/ceph/src/mon/MDSMonitor.cc:647
#10 0x000000010047f2d0 in MDSMonitor::prepare_update (this=0x101684b00, op=...) at /home/pdonnell/ceph/src/mon/MDSMonitor.cc:506
#11 0x00000001003dd6ee in PaxosService::dispatch (this=0x101684b00, op=...) at /home/pdonnell/ceph/src/mon/PaxosService.cc:91
#12 0x00000001002a916e in Monitor::dispatch_op (this=this@entry=0x101bd1800, op=...) at /home/pdonnell/ceph/src/mon/Monitor.cc:4177
#13 0x00000001002aa8b3 in Monitor::_ms_dispatch (this=this@entry=0x101bd1800, m=m@entry=0x101ef0700) at /home/pdonnell/ceph/src/mon/Monitor.cc:4097
#14 0x00000001002d3ed3 in Monitor::ms_dispatch (this=0x101bd1800, m=0x101ef0700) at /home/pdonnell/ceph/src/mon/Monitor.h:878
#15 0x00000001002b03f6 in Dispatcher::ms_dispatch2 (this=0x101bd1800, m=...) at /home/pdonnell/ceph/src/msg/Dispatcher.h:125
#16 0x00007fffef572a5a in Messenger::ms_deliver_dispatch (m=..., this=0x101689800) at /home/pdonnell/ceph/src/msg/Messenger.h:642
#17 DispatchQueue::entry (this=0x101689a10) at /home/pdonnell/ceph/src/msg/DispatchQueue.cc:196
#18 0x00007fffef60ac5d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /home/pdonnell/ceph/src/msg/DispatchQueue.h:102
#19 0x00007fffee2316ba in start_thread (arg=0x7fffe36a8700) at pthread_create.c:333
#20 0x00007fffed4bb41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Reproduced on master. It's sufficient to do:

$ while sleep 0.5; do bin/ceph mds fail 0; done

with a vstart cluster and 1 MDS.

#2 Updated by Patrick Donnelly 3 months ago

  • Related to Bug #35850: mds: runs out of file descriptors after several respawns added

#3 Updated by Patrick Donnelly 3 months ago

  • Status changed from Verified to Need Review

#4 Updated by Patrick Donnelly 3 months ago

  • Status changed from Need Review to Pending Backport

#5 Updated by Patrick Donnelly 3 months ago

  • Copied to Backport #35858: mimic: MDSMonitor: lookup of gid in prepare_beacon that has been removed will cause exception added

#6 Updated by Patrick Donnelly 3 months ago

  • Copied to Backport #35859: luminous: MDSMonitor: lookup of gid in prepare_beacon that has been removed will cause exception added

#7 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF