Feature #12193
OSD's are not updating osdmap properly after monitoring crash
0%
Description
Lately I had an issue where the whole cluster went down, because of a monitor crash caused by a faulty crushmap injection, see #12047 as a reference.
2015-06-30 13:43:00.878364 7f9feb23d840 0 osd.5 1547 load_pgs 2015-06-30 13:43:01.076863 7f9feb23d840 0 osd.5 1547 load_pgs opened 43 pgs 2015-06-30 13:43:01.077486 7f9feb23d840 -1 osd.5 1547 log_to_monitors {default=true} 2015-06-30 13:43:01.086179 7f9fd76bd700 0 osd.5 1547 ignoring osdmap until we have initialized 2015-06-30 13:43:01.089618 7f9fd76bd700 0 osd.5 1547 ignoring osdmap until we have initialized 2015-06-30 13:43:02.086065 7f9fccea8700 -1 *** Caught signal (Segmentation fault) ** in thread 7f9fccea8700 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: /usr/bin/ceph-osd() [0xbef08c] 2: (()+0xf0a0) [0x7f9fea1500a0] 3: /usr/bin/ceph-osd() [0xd5c934] 4: (crush_do_rule()+0x390) [0xd5d570] 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x8b) [0xcb146b] 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x7c) [0xc9a68c] 7: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x10f) [0xc9a82f] 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x1e2) [0x7d1dc2] 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x262) [0x7d2cd2] 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x8385c4] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccce79] 12: (ThreadPool::WorkThread::entry()+0x10) [0xccee70] 13: (()+0x6b50) [0x7f9fea147b50] 14: (clone()+0x6d) [0x7f9fe8b6395d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -31> 2015-06-30 13:43:00.127913 7f9feb23d840 5 asok(0x50be000) register_command perfcounters_dump hook 0x5082050 -30> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command 1 hook 0x5082050 -29> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command perf dump hook 0x5082050 -28> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command perfcounters_schema hook 0x5082050 -27> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command 2 hook 0x5082050 -26> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command perf schema hook 0x5082050 -25> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command perf reset hook 0x5082050 -24> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command config show hook 0x5082050 -23> 2015-06-30 13:43:00.127998 7f9feb23d840 5 asok(0x50be000) register_command config set hook 0x5082050 -22> 2015-06-30 13:43:00.128058 7f9feb23d840 5 asok(0x50be000) register_command config get hook 0x5082050 -21> 2015-06-30 13:43:00.128060 7f9feb23d840 5 asok(0x50be000) register_command config diff hook 0x5082050 -20> 2015-06-30 13:43:00.128067 7f9feb23d840 5 asok(0x50be000) register_command log flush hook 0x5082050 -19> 2015-06-30 13:43:00.128067 7f9feb23d840 5 asok(0x50be000) register_command log dump hook 0x5082050 -18> 2015-06-30 13:43:00.128085 7f9feb23d840 5 asok(0x50be000) register_command log reopen hook 0x5082050 -17> 2015-06-30 13:43:00.130445 7f9feb23d840 0 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 2947 -16> 2015-06-30 13:43:00.132297 7f9feb23d840 1 finished global_init_daemonize -15> 2015-06-30 13:43:00.152866 7f9feb23d840 0 filestore(/var/lib/ceph/osd/ceph-5) backend xfs (magic 0x58465342) -14> 2015-06-30 13:43:00.244827 7f9feb23d840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: FIEMAP ioctl is supported and appears to work -13> 2015-06-30 13:43:00.244853 7f9feb23d840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option -12> 2015-06-30 13:43:00.604742 7f9feb23d840 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: syscall(SYS_syncfs, fd) fully supported -11> 2015-06-30 13:43:00.604855 7f9feb23d840 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: disabling extsize, kernel 3.2.0-4-amd64 is older than 3.5 and has buggy extsize ioctl -10> 2015-06-30 13:43:00.714798 7f9feb23d840 0 filestore(/var/lib/ceph/osd/ceph-5) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled -9> 2015-06-30 13:43:00.869904 7f9feb23d840 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello -8> 2015-06-30 13:43:00.877680 7f9feb23d840 0 osd.5 1547 crush map has features 1107558400, adjusting msgr requires for clients -7> 2015-06-30 13:43:00.878332 7f9feb23d840 0 osd.5 1547 crush map has features 1107558400 was 8705, adjusting msgr requires for mons -6> 2015-06-30 13:43:00.878332 7f9feb23d840 0 osd.5 1547 crush map has features 1107558400, adjusting msgr requires for osds -5> 2015-06-30 13:43:00.878364 7f9feb23d840 0 osd.5 1547 load_pgs -4> 2015-06-30 13:43:01.076863 7f9feb23d840 0 osd.5 1547 load_pgs opened 43 pgs -3> 2015-06-30 13:43:01.077486 7f9feb23d840 -1 osd.5 1547 log_to_monitors {default=true} -2> 2015-06-30 13:43:01.086179 7f9fd76bd700 0 osd.5 1547 ignoring osdmap until we have initialized -1> 2015-06-30 13:43:01.089618 7f9fd76bd700 0 osd.5 1547 ignoring osdmap until we have initialized 0> 2015-06-30 13:43:02.086065 7f9fccea8700 -1 *** Caught signal (Segmentation fault) ** in thread 7f9fccea8700 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: /usr/bin/ceph-osd() [0xbef08c] 2: (()+0xf0a0) [0x7f9fea1500a0] 3: /usr/bin/ceph-osd() [0xd5c934] 4: (crush_do_rule()+0x390) [0xd5d570] 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x8b) [0xcb146b] 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x7c) [0xc9a68c] 7: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x10f) [0xc9a82f] 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x1e2) [0x7d1dc2] 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x262) [0x7d2cd2] 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x8385c4] 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccce79] 12: (ThreadPool::WorkThread::entry()+0x10) [0xccee70] 13: (()+0x6b50) [0x7f9fea147b50] 14: (clone()+0x6d) [0x7f9fe8b6395d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
If you need any further debugging information, which is not already in the referenced ticket, let me know.
Related issues
Associated revisions
tools/ceph-objectstore-tool: add "set-osdmap" command
Fixes: #12193
Signed-off-by: Kefu Chai <kchai@redhat.com>
tools/ceph-objectstore-tool: add "set-osdmap" command
Fixes: #12193
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 3e30c1746fb8d90b04e4776849069db0b7737c87)
Conflicts:
src/tools/ceph_objectstore_tool.cc (trivial)
History
#2 Updated by Jonas Weismüller over 8 years ago
- File ceph-osd.5.log.1.gz added
#3 Updated by Kefu Chai over 8 years ago
seems the cached osdmap in objectstore still has the bad crush map. and before the OSD reaches "STATE_BOOTING", the fixed OSDMap messages are ignored. meanwhile, the peering work queue hits the bad crush map, and brings down the OSD daemon.
#4 Updated by Kefu Chai over 8 years ago
- Status changed from New to Fix Under Review
add a command allowing user to rewrite the osdmap in OSD's objectstore:
#5 Updated by Samuel Just over 8 years ago
- Tracker changed from Bug to Feature
- Target version set to v9.0.7
#6 Updated by Kefu Chai over 8 years ago
- Status changed from Fix Under Review to Resolved
still need a command of ceph-monstore-tool to extract the incremental map from mon store.
#7 Updated by Loïc Dachary about 8 years ago
- Status changed from Resolved to Pending Backport
- Target version deleted (
v9.0.7) - Backport set to hammer
#8 Updated by Loïc Dachary about 8 years ago
- Copied to Backport #14894: hammer: OSD's are not updating osdmap properly after monitoring crash added
#9 Updated by Loïc Dachary about 8 years ago
- Status changed from Pending Backport to Resolved