Project

General

Profile

Bug #517

monitors crashing on startup after injecting corrupt crush map

Added by John Leach almost 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Monitor
Target version:
Start date:
10/24/2010
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I followed the instructions at http://ceph.newdream.net/wiki/OSD_cluster_expansion/contraction to add a 3rd osd node to my existing 2 node cluster, but forgot to recode the crushmap before injecting it (so I accidentally injected the decoded map).

As I injected it (from the new osd node), the monitors on the two other nodes crashed. And now neither of them will start up either, outputting the same stack trace.

root@srv-ohkpf:/root# ceph osd setcrushmap -i /tmp/crush.txt
read 850 bytes from /tmp/crush.txt
2010-10-24 13:09:17.167566 mon <- [osd,setcrushmap]
2010-10-24 13:09:18.189259 7f0aea70b710 monclient: hunting for new mon
2010-10-24 13:09:18.190237 7f0ae9608710 -- 10.61.136.222:0/8447 >> 10.135.211.78:6789/0 pipe(0xa698a0 sd=-1 pgs=0 cs=0 l=0).fault first fault
2010-10-24 13:09:20.166031 7f0ae9507710 -- 10.61.136.222:0/8447 >> 10.106.124.118:6789/0 pipe(0xa67420 sd=-1 pgs=0 cs=0 l=0).fault first fault
2010-10-24 13:09:23.166973 7f0ae9406710 -- 10.61.136.222:0/8447 >> 10.135.211.78:6789/0 pipe(0xa67b60 sd=-1 pgs=0 cs=0 l=0).fault first fault

and from the mon log file:

2010-10-24 13:09:17.167146 7f9482418710 mon.0@0(leader) e1 handle_command mon_command(osd setcrushmap v 0) v1
./crush/CrushWrapper.h: In function 'void CrushWrapper::decode(ceph::buffer::list::iterator&)':
./crush/CrushWrapper.h:437: FAILED assert(magic == 0x00010000ul)
 ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541)
 1: (OSDMap::apply_incremental(OSDMap::Incremental&)+0x12d8) [0x4a9c58]
 2: (OSDMonitor::update_from_paxos()+0xf1) [0x4945d1]
 3: (PaxosService::_commit()+0x25) [0x48a535]
 4: (finish_contexts(std::list<Context*, std::allocator<Context*> >&, int)+0x1b1) [0x486d01]
 5: (Paxos::handle_accept(MMonPaxos*)+0x39e) [0x482fce]
 6: (Paxos::dispatch(PaxosServiceMessage*)+0x1b3) [0x485c13]
 7: (Monitor::_ms_dispatch(Message*)+0x8e0) [0x472760]
 8: (Monitor::ms_dispatch(Message*)+0x67) [0x47f2e7]
 9: (SimpleMessenger::dispatch_entry()+0x79b) [0x45ac8b]
 10: (SimpleMessenger::DispatchThread::entry()+0x1f) [0x44c35f]
 11: (Thread::_entry_func(void*)+0xa) [0x46165a]
 12: (()+0x69ca) [0x7f94846fd9ca]
 13: (clone()+0x6d) [0x7f948391c70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541)
 1: (sigabrt_handler(int)+0xde) [0x557d4e]
 2: (()+0x33af0) [0x7f9483869af0]
 3: (gsignal()+0x35) [0x7f9483869a75]
 4: (abort()+0x180) [0x7f948386d5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f948411f8e5]
 6: (()+0xcad16) [0x7f948411dd16]
 7: (()+0xcad43) [0x7f948411dd43]
 8: (()+0xcae3e) [0x7f948411de3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x69c) [0x542b3c]
 10: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x744) [0x4a6304]
 11: (OSDMap::apply_incremental(OSDMap::Incremental&)+0x12d8) [0x4a9c58]
 12: (OSDMonitor::update_from_paxos()+0xf1) [0x4945d1]
 13: (PaxosService::_commit()+0x25) [0x48a535]
 14: (finish_contexts(std::list<Context*, std::allocator<Context*> >&, int)+0x1b1) [0x486d01]
 15: (Paxos::handle_accept(MMonPaxos*)+0x39e) [0x482fce]
 16: (Paxos::dispatch(PaxosServiceMessage*)+0x1b3) [0x485c13]
 17: (Monitor::_ms_dispatch(Message*)+0x8e0) [0x472760]
 18: (Monitor::ms_dispatch(Message*)+0x67) [0x47f2e7]
 19: (SimpleMessenger::dispatch_entry()+0x79b) [0x45ac8b]
 20: (SimpleMessenger::DispatchThread::entry()+0x1f) [0x44c35f]
 21: (Thread::_entry_func(void*)+0xa) [0x46165a]

History

#1 Updated by Sage Weil almost 9 years ago

  • Category set to Monitor
  • Assignee set to Colin McCabe
  • Target version set to v0.23

Need to decode the provided map in a try {} block to verify it is valid before using it. In OSDMonitor::prepare_command() I think.

Also available in: Atom PDF