Project

General

Profile

Feature #12193

OSD's are not updating osdmap properly after monitoring crash

Added by Jonas Weismüller over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
hammer
Reviewed:
Affected Versions:
Pull request ID:

Description

Lately I had an issue where the whole cluster went down, because of a monitor crash caused by a faulty crushmap injection, see #12047 as a reference.

2015-06-30 13:43:00.878364 7f9feb23d840  0 osd.5 1547 load_pgs
2015-06-30 13:43:01.076863 7f9feb23d840  0 osd.5 1547 load_pgs opened 43 pgs
2015-06-30 13:43:01.077486 7f9feb23d840 -1 osd.5 1547 log_to_monitors {default=true}
2015-06-30 13:43:01.086179 7f9fd76bd700  0 osd.5 1547 ignoring osdmap until we have initialized
2015-06-30 13:43:01.089618 7f9fd76bd700  0 osd.5 1547 ignoring osdmap until we have initialized
2015-06-30 13:43:02.086065 7f9fccea8700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f9fccea8700

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: /usr/bin/ceph-osd() [0xbef08c]
 2: (()+0xf0a0) [0x7f9fea1500a0]
 3: /usr/bin/ceph-osd() [0xd5c934]
 4: (crush_do_rule()+0x390) [0xd5d570]
 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x8b) [0xcb146b]
 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x7c) [0xc9a68c]
 7: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x10f) [0xc9a82f]
 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x1e2) [0x7d1dc2]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x262) [0x7d2cd2]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x8385c4]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccce79]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xccee70]
 13: (()+0x6b50) [0x7f9fea147b50]
 14: (clone()+0x6d) [0x7f9fe8b6395d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -31> 2015-06-30 13:43:00.127913 7f9feb23d840  5 asok(0x50be000) register_command perfcounters_dump hook 0x5082050
   -30> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command 1 hook 0x5082050
   -29> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command perf dump hook 0x5082050
   -28> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command perfcounters_schema hook 0x5082050
   -27> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command 2 hook 0x5082050
   -26> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command perf schema hook 0x5082050
   -25> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command perf reset hook 0x5082050
   -24> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command config show hook 0x5082050
   -23> 2015-06-30 13:43:00.127998 7f9feb23d840  5 asok(0x50be000) register_command config set hook 0x5082050
   -22> 2015-06-30 13:43:00.128058 7f9feb23d840  5 asok(0x50be000) register_command config get hook 0x5082050
   -21> 2015-06-30 13:43:00.128060 7f9feb23d840  5 asok(0x50be000) register_command config diff hook 0x5082050
   -20> 2015-06-30 13:43:00.128067 7f9feb23d840  5 asok(0x50be000) register_command log flush hook 0x5082050
   -19> 2015-06-30 13:43:00.128067 7f9feb23d840  5 asok(0x50be000) register_command log dump hook 0x5082050
   -18> 2015-06-30 13:43:00.128085 7f9feb23d840  5 asok(0x50be000) register_command log reopen hook 0x5082050
   -17> 2015-06-30 13:43:00.130445 7f9feb23d840  0 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 2947
   -16> 2015-06-30 13:43:00.132297 7f9feb23d840  1 finished global_init_daemonize
   -15> 2015-06-30 13:43:00.152866 7f9feb23d840  0 filestore(/var/lib/ceph/osd/ceph-5) backend xfs (magic 0x58465342)
   -14> 2015-06-30 13:43:00.244827 7f9feb23d840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: FIEMAP ioctl is supported and appears to work
   -13> 2015-06-30 13:43:00.244853 7f9feb23d840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
   -12> 2015-06-30 13:43:00.604742 7f9feb23d840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: syscall(SYS_syncfs, fd) fully supported
   -11> 2015-06-30 13:43:00.604855 7f9feb23d840  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-5) detect_features: disabling extsize, kernel 3.2.0-4-amd64 is older than 3.5 and has buggy extsize ioctl
   -10> 2015-06-30 13:43:00.714798 7f9feb23d840  0 filestore(/var/lib/ceph/osd/ceph-5) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
    -9> 2015-06-30 13:43:00.869904 7f9feb23d840  0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
    -8> 2015-06-30 13:43:00.877680 7f9feb23d840  0 osd.5 1547 crush map has features 1107558400, adjusting msgr requires for clients
    -7> 2015-06-30 13:43:00.878332 7f9feb23d840  0 osd.5 1547 crush map has features 1107558400 was 8705, adjusting msgr requires for mons
    -6> 2015-06-30 13:43:00.878332 7f9feb23d840  0 osd.5 1547 crush map has features 1107558400, adjusting msgr requires for osds
    -5> 2015-06-30 13:43:00.878364 7f9feb23d840  0 osd.5 1547 load_pgs
    -4> 2015-06-30 13:43:01.076863 7f9feb23d840  0 osd.5 1547 load_pgs opened 43 pgs
    -3> 2015-06-30 13:43:01.077486 7f9feb23d840 -1 osd.5 1547 log_to_monitors {default=true}
    -2> 2015-06-30 13:43:01.086179 7f9fd76bd700  0 osd.5 1547 ignoring osdmap until we have initialized
    -1> 2015-06-30 13:43:01.089618 7f9fd76bd700  0 osd.5 1547 ignoring osdmap until we have initialized
     0> 2015-06-30 13:43:02.086065 7f9fccea8700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f9fccea8700
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: /usr/bin/ceph-osd() [0xbef08c]
 2: (()+0xf0a0) [0x7f9fea1500a0]
 3: /usr/bin/ceph-osd() [0xd5c934]
 4: (crush_do_rule()+0x390) [0xd5d570]
 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x8b) [0xcb146b]
 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x7c) [0xc9a68c]
 7: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x10f) [0xc9a82f]
 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x1e2) [0x7d1dc2]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x262) [0x7d2cd2]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x14) [0x8385c4]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccce79]
 12: (ThreadPool::WorkThread::entry()+0x10) [0xccee70]
 13: (()+0x6b50) [0x7f9fea147b50]
 14: (clone()+0x6d) [0x7f9fe8b6395d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

If you need any further debugging information, which is not already in the referenced ticket, let me know.

ceph-osd.5.log.bz2 (151 Bytes) Jonas Weismüller, 07/01/2015 07:12 AM

ceph-osd.5.log.1.gz (3.03 KB) Jonas Weismüller, 07/01/2015 07:51 AM


Related issues

Related to Ceph - Bug #12047: monitor segmentation fault on faulty crushmap Duplicate 06/17/2015
Copied to Ceph - Backport #14894: hammer: OSD's are not updating osdmap properly after monitoring crash Resolved

Associated revisions

Revision 3e30c174 (diff)
Added by Kefu Chai over 8 years ago

tools/ceph-objectstore-tool: add "set-osdmap" command

Fixes: #12193
Signed-off-by: Kefu Chai <>

Revision c60eee1d (diff)
Added by Kefu Chai about 8 years ago

tools/ceph-objectstore-tool: add "set-osdmap" command

Fixes: #12193
Signed-off-by: Kefu Chai <>
(cherry picked from commit 3e30c1746fb8d90b04e4776849069db0b7737c87)

Conflicts:
src/tools/ceph_objectstore_tool.cc (trivial)

History

#1 Updated by Jonas Weismüller over 8 years ago

uploading log file

#3 Updated by Kefu Chai over 8 years ago

seems the cached osdmap in objectstore still has the bad crush map. and before the OSD reaches "STATE_BOOTING", the fixed OSDMap messages are ignored. meanwhile, the peering work queue hits the bad crush map, and brings down the OSD daemon.

#4 Updated by Kefu Chai over 8 years ago

  • Status changed from New to Fix Under Review

add a command allowing user to rewrite the osdmap in OSD's objectstore:

https://github.com/ceph/ceph/pull/5127

#5 Updated by Samuel Just over 8 years ago

  • Tracker changed from Bug to Feature
  • Target version set to v9.0.7

#6 Updated by Kefu Chai over 8 years ago

  • Status changed from Fix Under Review to Resolved

still need a command of ceph-monstore-tool to extract the incremental map from mon store.

#7 Updated by Loïc Dachary about 8 years ago

  • Status changed from Resolved to Pending Backport
  • Target version deleted (v9.0.7)
  • Backport set to hammer

#8 Updated by Loïc Dachary about 8 years ago

  • Copied to Backport #14894: hammer: OSD's are not updating osdmap properly after monitoring crash added

#9 Updated by Loïc Dachary about 8 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF