Project

General

Profile

Actions

Bug #330

closed

Crash on OSD::_share_map_outgoing(const entity_inst_t&)

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When upgrading to the latest unstable all my OSD's (30 in total) crashed with the following message:

osd/OSD.cc: In function 'void OSD::_share_map_outgoing(const entity_inst_t&)':
osd/OSD.cc:1791: FAILED assert(inst.name.is_osd())
 1: (OSD::update_heartbeat_peers()+0x1d3f) [0x4da66f]
 2: (OSD::activate_map(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&)+0x8ee) [0x4db61e]
 3: (OSD::handle_osd_map(MOSDMap*)+0x233a) [0x4e506a]
 4: (OSD::_dispatch(Message*)+0x230) [0x4ef400]
 5: (OSD::ms_dispatch(Message*)+0x39) [0x4efe39]
 6: (SimpleMessenger::dispatch_entry()+0x749) [0x461fa9]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x458f6c]
 8: (Thread::_entry_func(void*)+0xa) [0x46cf1a]
 9: (()+0x69ca) [0x7f978d6fb9ca]
 10: (clone()+0x6d) [0x7f978c91b6cd]

I've uploaded a few logs and corefiles to logger.ceph.widodh.nl in the directory /srv/ceph/issues/cosd_crash_share_outgoing_map, it seemed a bit useless to upload all the logs and corefiles.

Before doing this upgrade i brought my whole cluster down due to some other packages which had to be upgraded, like the kernel and all the OSD's were rebooted at the same time.

Actions #1

Updated by Wido den Hollander over 13 years ago

I got the cluster working again, by starting it in the following order:

  • Kill monitor and mds
  • Start all the OSD's
  • Then start the monitors
  • Then start the MDS

After following that boot sequence the cluster got up and running again.

Actions #2

Updated by Sage Weil over 13 years ago

  • Status changed from New to Resolved
Actions #3

Updated by Wido den Hollander over 13 years ago

The commit did not work, my OSD's kept crashing.

I place three new coredumps (preserved the timestamp) in the same directory on logger.ceph.widodh.nl

I manually reverted 9bfb8da9f925642bca46528a999124cd8b28ba2a and now the cluster is running again.

Actions

Also available in: Atom PDF