Actions
Bug #35848
closedMDSMonitor: lookup of gid in prepare_beacon that has been removed will cause exception
Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
other
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2018-09-07 06:28:53.829359 7fe856397700 1 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 fail_mds_gid 4864 mds.ceph-sshreeka-1536308179377-node6-mds role 0 2018-09-07 06:28:53.829589 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 prepare_beacon pending map now: 2018-09-07 06:28:53.829601 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 15 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-09-07 06:28:53.829639 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 16 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-09-07 06:28:53.829658 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 17 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-09-07 06:28:53.829667 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 2 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-09-07 06:28:53.829677 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 3 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-09-07 06:28:53.829694 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4885/ceph-sshreeka-1536308179377-node6-mds up:boot seq 4 v166) v7 from mds.? 172.16.115.21:6800/3695643259 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-09-07 06:28:53.832094 7fe856397700 4 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 filesystem_command prefix='mds fail' 2018-09-07 06:28:53.832105 7fe856397700 1 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 gid_from_arg: rank/GID 0 not a existent rank or GID 2018-09-07 06:28:53.832107 7fe856397700 4 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 prepare_command done, r=0 2018-09-07 06:28:53.832138 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 preprocess_beacon mdsbeacon(4864/ceph-sshreeka-1536308179377-node6-mds down:damaged seq 9 v166) v7 from mds.0 172.16.115.21:6800/2937610127 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-09-07 06:28:53.832158 7fe856397700 5 mon.ceph-sshreeka-1536308179377-node14-monmgr@1(leader).mds e166 _note_beacon mdsbeacon(4864/ceph-sshreeka-1536308179377-node6-mds down:damaged seq 9 v166) v7 noting time 2018-09-07 06:28:53.839276 7fe856397700 -1 *** Caught signal (Aborted) ** in thread 7fe856397700 thread_name:fn_monstore ceph version 12.2.4-42.1.hotfix.nvidia.el7cp (4a72ecd06cdc5a049945b166073ce39fbe631308) luminous (stable) 1: (()+0x931071) [0x5611702d4071] 2: (()+0xf680) [0x7fe86390e680] 3: (gsignal()+0x37) [0x7fe860c49207] 4: (abort()+0x148) [0x7fe860c4a8f8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fe8615587d5] 6: (()+0x5e746) [0x7fe861556746] 7: (()+0x5e773) [0x7fe861556773] 8: (()+0x5e993) [0x7fe861556993] 9: (std::__throw_out_of_range(char const*)+0x77) [0x7fe8615ab857] 10: (FSMap::get_info_gid(mds_gid_t) const+0xfc) [0x56116ff5e1ac] 11: (MDSMonitor::prepare_beacon(boost::intrusive_ptr<MonOpRequest>)+0x77d) [0x56116ff5190d] 12: (MDSMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x257) [0x56116ff58d97] 13: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x56116feb49d8] 14: (PaxosService::C_RetryMessage::_finish(int)+0x5e) [0x56116fdee3fe] 15: (Context::complete(int)+0x9) [0x56116fd9b7b9] 16: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x56116fda514c] 17: (Paxos::finish_round()+0x11e) [0x56116fea5a0e] 18: (Paxos::commit_finish()+0x71d) [0x56116fea6b0d] 19: (C_Committed::finish(int)+0x31) [0x56116feae961] 20: (Context::complete(int)+0x9) [0x56116fd9b7b9] 21: (MonitorDBStore::C_DoTransaction::finish(int)+0xa7) [0x56116feadb57] 22: (Context::complete(int)+0x9) [0x56116fd9b7b9] 23: (Finisher::finisher_thread_entry()+0x198) [0x56116ffd0558] 24: (()+0x7dd5) [0x7fe863906dd5] 25: (clone()+0x6d) [0x7fe860d11b3d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Problem introduced by this change: https://github.com/ceph/ceph/commit/624efc64323f99b2e843f376879c1080276e036f#diff-6c4f848e4bb0fe57e9c0f9bc67b14beaL354
The beacons are no longer dropped if the gid was removed from the pending_fsmap. We need to do a new check in prepare_beacon which operates on pending_fsmap.
Updated by Patrick Donnelly over 5 years ago
- Status changed from New to 12
(gdb) bt #0 0x00007fffed3e9428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007fffed3eb02a in __GI_abort () at abort.c:89 #2 0x00007fffedd300d5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x00007fffedd2dcc6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x00007fffedd2dd11 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x00007fffedd2df54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007fffedd579af in std::__throw_out_of_range(char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x0000000100487c00 in std::map<mds_gid_t, int, std::less<mds_gid_t>, std::allocator<std::pair<mds_gid_t const, int> > >::at (__k=..., this=<optimized out>) at /usr/include/c++/7/bits/stl_map.h:533 #8 FSMap::get_info_gid (this=this@entry=0x101684dc8, gid=...) at /home/pdonnell/ceph/src/mds/FSMap.h:357 #9 0x000000010047cbd6 in MDSMonitor::prepare_beacon (this=this@entry=0x101684b00, op=...) at /home/pdonnell/ceph/src/mon/MDSMonitor.cc:647 #10 0x000000010047f2d0 in MDSMonitor::prepare_update (this=0x101684b00, op=...) at /home/pdonnell/ceph/src/mon/MDSMonitor.cc:506 #11 0x00000001003dd6ee in PaxosService::dispatch (this=0x101684b00, op=...) at /home/pdonnell/ceph/src/mon/PaxosService.cc:91 #12 0x00000001002a916e in Monitor::dispatch_op (this=this@entry=0x101bd1800, op=...) at /home/pdonnell/ceph/src/mon/Monitor.cc:4177 #13 0x00000001002aa8b3 in Monitor::_ms_dispatch (this=this@entry=0x101bd1800, m=m@entry=0x101ef0700) at /home/pdonnell/ceph/src/mon/Monitor.cc:4097 #14 0x00000001002d3ed3 in Monitor::ms_dispatch (this=0x101bd1800, m=0x101ef0700) at /home/pdonnell/ceph/src/mon/Monitor.h:878 #15 0x00000001002b03f6 in Dispatcher::ms_dispatch2 (this=0x101bd1800, m=...) at /home/pdonnell/ceph/src/msg/Dispatcher.h:125 #16 0x00007fffef572a5a in Messenger::ms_deliver_dispatch (m=..., this=0x101689800) at /home/pdonnell/ceph/src/msg/Messenger.h:642 #17 DispatchQueue::entry (this=0x101689a10) at /home/pdonnell/ceph/src/msg/DispatchQueue.cc:196 #18 0x00007fffef60ac5d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /home/pdonnell/ceph/src/msg/DispatchQueue.h:102 #19 0x00007fffee2316ba in start_thread (arg=0x7fffe36a8700) at pthread_create.c:333 #20 0x00007fffed4bb41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Reproduced on master. It's sufficient to do:
$ while sleep 0.5; do bin/ceph mds fail 0; done
with a vstart cluster and 1 MDS.
Updated by Patrick Donnelly over 5 years ago
- Related to Bug #35850: mds: runs out of file descriptors after several respawns added
Updated by Patrick Donnelly over 5 years ago
- Status changed from 12 to Fix Under Review
Updated by Patrick Donnelly over 5 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Patrick Donnelly over 5 years ago
- Copied to Backport #35858: mimic: MDSMonitor: lookup of gid in prepare_beacon that has been removed will cause exception added
Updated by Patrick Donnelly over 5 years ago
- Copied to Backport #35859: luminous: MDSMonitor: lookup of gid in prepare_beacon that has been removed will cause exception added
Updated by Nathan Cutler over 5 years ago
- Status changed from Pending Backport to Resolved
Actions