Bug #330
closedCrash on OSD::_share_map_outgoing(const entity_inst_t&)
0%
Description
When upgrading to the latest unstable all my OSD's (30 in total) crashed with the following message:
osd/OSD.cc: In function 'void OSD::_share_map_outgoing(const entity_inst_t&)': osd/OSD.cc:1791: FAILED assert(inst.name.is_osd()) 1: (OSD::update_heartbeat_peers()+0x1d3f) [0x4da66f] 2: (OSD::activate_map(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&)+0x8ee) [0x4db61e] 3: (OSD::handle_osd_map(MOSDMap*)+0x233a) [0x4e506a] 4: (OSD::_dispatch(Message*)+0x230) [0x4ef400] 5: (OSD::ms_dispatch(Message*)+0x39) [0x4efe39] 6: (SimpleMessenger::dispatch_entry()+0x749) [0x461fa9] 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x458f6c] 8: (Thread::_entry_func(void*)+0xa) [0x46cf1a] 9: (()+0x69ca) [0x7f978d6fb9ca] 10: (clone()+0x6d) [0x7f978c91b6cd]
I've uploaded a few logs and corefiles to logger.ceph.widodh.nl in the directory /srv/ceph/issues/cosd_crash_share_outgoing_map, it seemed a bit useless to upload all the logs and corefiles.
Before doing this upgrade i brought my whole cluster down due to some other packages which had to be upgraded, like the kernel and all the OSD's were rebooted at the same time.
Updated by Wido den Hollander over 13 years ago
I got the cluster working again, by starting it in the following order:
- Kill monitor and mds
- Start all the OSD's
- Then start the monitors
- Then start the MDS
After following that boot sequence the cluster got up and running again.
Updated by Sage Weil over 13 years ago
- Status changed from New to Resolved
Updated by Wido den Hollander over 13 years ago
The commit did not work, my OSD's kept crashing.
I place three new coredumps (preserved the timestamp) in the same directory on logger.ceph.widodh.nl
I manually reverted 9bfb8da9f925642bca46528a999124cd8b28ba2a and now the cluster is running again.
Updated by Sage Weil over 13 years ago
Fixed (more) by ef711e2eead039b9819b8380f7b1ea6ebd84160d